Word Count Example in Detail

WordCount.java

The full Java code file is listed below. This example uses the new API (you will see many examples using the older API online; try to stick with this style). It contains one outer class with some methods and some inner classes. Some things to note:

  • The package given to this file is wc (short for word count). You will change this for other work that you do to solve other problems on other data.
  • lines 6- 13 show the hadoop Java libraries that you typically import.
  • line 15 (highlighted) depicts how you define any hadoop MapReduce class, which we happened to call WordCount. For other work, you would change the name of the class.
  • lines 17 -20 show the typical main that demonstrates how to indicate that you want to run this class and its mapper and reducer within the hadoop framework, via an hadoop class called the ToolRunner.
  • The run method in lines 22-46 shows how you set up parameters for the job. This is a typical minimum set of job parameters. You may find that you need others, such as requsting how many map tasks and redcue tasks to use. See the hadoop documentation for the Job class for more information.
  • The inner class called Map starts on line 48 (highlighted). It extends the Mapper class defined in the hadoop API. Within the < and > brackets are listed the data types for the input key and value and the emitted key and value. You override a method called map that defines the work of the mapper function.
  • The inner class called Reduce starts on line 64. (highlighted). It extends the Reducer class defined in the hadoop API. Within the < and > brackets are listed the data types for the input key and value and the emitted key and value. You override a method called reduce that defines the work of the reducer function.
  • Emitting values is done with a Context class that is defined within the hadoop API. A Context object is sent by the API to the map function of the Mapper class and the reduce function of the Reducer class.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
package wc;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

public class WordCount extends Configured implements Tool {

  public static void main(String args[]) throws Exception {
    int res = ToolRunner.run(new WordCount(), args);
    System.exit(res);
  }

  public int run(String[] args) throws Exception {
    Path inputPath = new Path(args[0]);
    Path outputPath = new Path(args[1]);

    Configuration conf = getConf();
    Job job = new Job(conf, this.getClass().toString());

    FileInputFormat.setInputPaths(job, inputPath);
    FileOutputFormat.setOutputPath(job, outputPath);

    job.setJobName("WordCount");
    job.setJarByClass(WordCount.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setCombinerClass(Reduce.class);
    job.setReducerClass(Reduce.class);

    return job.waitForCompletion(true) ? 0 : 1;
  }

  public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    public void map(LongWritable key, Text value,
                    Mapper.Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable value : values) {
        sum += value.get();
      }

      context.write(key, new IntWritable(sum));
    }
  }

}

The ant build file

Hadoop launches jobs by getting a jar file containg the compiled Java code. In addition, we typically send two command line arguments through to the Java program: the input data file or directory, and an ouput directory for the results from the reduce tasks. Using a tool called ant makes it pretty quick to create a jar file from the above code.

The ant tool uses an xml file that describes what needs to be compiled and packaged into a jar file. Here is the one you used for the above WordCount example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<project name="hadoopCompile" default="jar" basedir=".">
   <target name="init">
      <property name="sourceDir" value="."/>
      <property name="outputDir" value="classes" />
      <property name="buildDir" value="jar" />
      <property name="lib.dir"     value="/usr/lib/hadoop"/>

      <path id="classpath">
        <fileset dir="${lib.dir}" includes="**/*.jar"/>
      </path>
   </target>
   <target name="clean" depends="init">
      <delete dir="${outputDir}" />
      <delete dir="${buildDir}" />
   </target>
   <target name="prepare" depends="clean">
      <mkdir dir="${outputDir}" />
      <mkdir dir="${buildDir}"/>
   </target>
   <target name="compile" depends="prepare">
     <javac srcdir="${sourceDir}" destdir="${outputDir}" classpathref="classpath" />
   </target>
   <target name="jar"  depends="compile">
        
        <jar destfile="${buildDir}/wc.jar" basedir="${outputDir}">
            <manifest>
                <attribute name="Main-Class" value="wc.WordCount"/>
            </manifest>
        </jar>
   </target>
</project>

Line 6 contains the path to the hadoop directory that contains a directory called lib that holds all the libraries. This is installation dependent and may be different on your system.

Classes get compiled in a subdirectory called classes, which ant will create when you run it using this xml file (line 4).

The fully built jar goes into a subdirectory call jar, which ant will also create (line 5).

Much like a C makefile, the various <target> ... </target> entries in this file designate depndencies on other targets. For example, the target named jar in line 23 depends on the target named compile above it. When executing ant, it runs similar to make, in that it goes up the dependency list of targets, executing each one from the first target down to the last (the way they were ordered in this particular case).

Line 21 is where the compilation is defined, using javac. Line 25 is where the creation of the jar file takes places, using the jar command.

Please click next in the upper right or lower left corner of this document to continue to some further examples you could try.