Hadoop Basics

This article gives a view of the basics of Hadoop.Hadoop is an open source project and it is used for processing large datasets in parallel with the use of low level commodity machines. Hadoop  consists of two main parts.

  1. Distributed File System (HDFS):- It is an optimized file system for distributed processing of very large datasets on commodity hardware.
  2. Map Reduce Framework: – The map reduce framework consists of two main phases to process the data.

Phase1:- Map phase

Phase2:- Reduce phase

Next we discuss how to create a sample Hadoop application. This application takes different dictionaries of english to other languages. Example (English-spanish) (English-Italian) (English-french) and create a Dictionary file that has the english word followed by all the translations pipe-separated. First we download the hadoop.

After that we go to the directory we first to install hadoop and download it   wget http://apache.favoritelinks.net//hadoop/core/stable/hadoop-0.20.2.tar.gz.

Then unzip it tar zxvf hadoop-0.21.0.tar.gz. For getting the dictionary files we downloaded them from http://www.ilovelanguages.com/IDP/files/.txt.

After getting the dictionary files next thing will be put our files in HDFS. For accomplishing this we need first to format a filesystem to HDFS. For the formatting purpose first we go to the bin directory of hadoop and execute   ./hadoop namenode -format. This will by default format the directory /tmp/hadoop-username/dfs/name. After formatting the system we put our dictionary files into this filesystem.

We can merge the files into one to put them there and this merging operation will be done by using s PutMerge operation. First we are merging the files and then copying them to HDFS which is easier and our example files are small.

  1. cat French.txt >> fulldictionary.txt
  2. cat Italian.txt >> fulldictionary.txt
  3. cat Spanish.txt >> fulldictionary.txt

To copy the file to hdfs we execute the following command:

. /hadoop fs -put /home/cscarioni/Documentos/hadooparticlestuff/fulldictionary.txt /tmp/hadoop-cscarioni/dfs/name/file

Eg: – The below actual map reduce program will be completely contained in one unique Java file.

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Dictionary
{
public static class WordMapper extends Mapper<Text, Text, Text, Text>
{
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString(),",");
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(key, word);
}
}
}
public static class AllTranslationsReducer
extends Reducer<Text,Text,Text,Text>
{
private Text result = new Text();
public void reduce(Text key, Iterable&lt;Text&gt; values,
Context context
) throws IOException, InterruptedException
{
String translations = "";
for (Text val : values)
{
translations += "|"+val.toString();
}
result.set(translations);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "dictionary");
job.setJarByClass(Dictionary.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(AllTranslationsReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path("/tmp/hadoop-cscarioni/dfs/name/file"));
FileOutputFormat.setOutputPath(job, new Path("output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

In the above program our class creates three parts.

1)      A static class holds the mapper:-

The working of mapper’s is to produce a list of key value pairs to be processed latter. The ideal structure of this list of key value pairs is so that the keys will be repeated in many elements of the list (produced by this same mapper or another one that will combine its results with this one) so the next phases of the map reduce algorithm make use of them. A mapper receives a key, value pair as parameters, and as said, produce a list of new key, value pairs.  The key value pair accessed by the mapper depends on the InputFormat implementation used. In the above example we are using KeyValueTextInputFormat. This implementation uses as each key value pair, the beginning of each line of the input file until the first space as the key, and the rest of the line as the value. So if a line contains aaa bbb,ccc,ddd we’ll get aaa as the key and bbb,ccc,ddd as the value. From each input to the mapper, the generated list of key value pairs is the key combined with each of the values separated by comma. For the input aaa bbb,ccc,ddd the output will be: List(aaa bbb, aaa ccc, aaa ddd) and that for each input to the mapper.

2)      Other static class holds the reducer:-

In between mapper and reducer the shuffler and combining phases take place. The shuffler phase ensure that every key value pair with the same key goes to the same reducer, the combining part converts all the key value pairs of the same key to the grouping form key, list(values), which is what the reducer ultimately receives. Reducer’s helps to take the key list (values) pair, operate on the grouped values and store it somewhere. It takes the key list (values) pair, loop through the values concatenating them to a pipe separated string and send the new key value pair to the output. So the pair aaa list (aaa,bbb) is converted to aaa aaa|bbb and stored out. To run our program simply run it as a normal java main file with hadoop libs on the classpath (all the jars in the hadoop home directory and all the jars in the hadoop lib directory. you can also run the hadoop command with the classpath option to get the full classpath needed). For this first test i used the IDE DrJava. Running the program in my case generated a file called part-r-00000 with the expected result.

Distributing it:

Map Reduce framework main reason of existence is to run the processing of large amounts of data in a distributed manner, in commodity machines. In fact running it on only one machine doesn’t have much more utility than teaching us how it works.

Distributing the application can be the subject of another more advanced post.

3)      Main method works as the driver of our application

The following are the reasons behind the usage of the Hadoop.

  1. Need to process huge datasets on large clusters of computer.
  2. Very expensive to build reliability into each application
  3. Nodes fail every day.Failure is expected, rather than exceptional.The number of nodes in a cluster is not constant
  4. Need a common infrastructure .Efficient, reliable, easy to use
  5. Open Source, Apache Licence