Hadoop TDG 2 -- introduction-阿里云开发者社区

首先我们为什么需要Hadoop？

The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it.

面对海量的数据，我们需要高效的分析和存储他们，而Hadoop可以做到这点，

This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS, and analysis by MapReduce

Hadoop和现有技术的区别是什么？

和RDBMS的区别

MapReduce is a good fit for problems that need to analyze the whole dataset , in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data .
MapReduce suits applications where the data is written once, and read many times , whereas a relational database is good for datasets that are continually updated .
MapReduce works well on unstructured or semistructured data , since it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.

Both as relational databases start incorporating some of the ideas from MapReduce (such as Aster Data’s and Greenplum’s databases), and, from the other direction, as higher-level query languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable to traditional database programmers.

和Grid Computing的区别

MapReduce tries to colocate the data with the compute node, so data access is fast since it is local, known as data locality , is at the heart of MapReduce and is the reason for its good performance.

Grid Computing is to distribute the work across a cluster of machines, which access a shared filesystem , hosted by a SAN. This works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck, and compute nodes become idle.

Hadoop的来历

Hadoop was created by Doug Cutting , the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch , an open source web search engine, itself a part of the Lucene project.

MapReduce的例子

下面我们通过一个简单的例子来了解MapReduce的过程，

这个例子就是给出了气象台的气温的记录，想从中找出每年的最高气温。

可以看出给出的初始数据是，非结构化的

Example

(0, 0067011990999991950 051507004...9999999N9+0000 1+99999999999...)

(424, 0043012650999991949 032418004...0500001N9+0078 1+99999999999...)
(106, 0043011990999991950 051512004...9999999N9+0022 1+99999999999...)
(212, 0043011990999991950 051518004...9999999N9-00 11 1+99999999999...)
(318, 0043012650999991949 032412004...0500001N9+0111 1+99999999999...)

Map的任务就是从大量的初始数据中抽取需要的少量数据，组成(Key, Value)的结构，因为Map过程都是本地进行的，所以是很高效的，只需要通过网络传输抽取出的少量的数据。

The map function merely extracts the year and the air temperature (indicated in bold text), and emits them as its output.

(1950, 0)

(1949, 78)
(1950, 22)
(1950, -11)
(1949, 111)

Framework在把Map输出的数据传给Reduce之前，需要做个预处理，就是把(Key, Value) pairs，按照key排序
The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key.
(1949, [111, 78])
(1950, [0, 22, -11])

上面只是个抽象的表示，实际传给reduce的应该只是排序的序列，如下

(1949, 111)

(1949, 78)

(1950, 0)
(1950, 22)
(1950, -11)

Reduce就是按照你定义的逻辑，从Map产生的数据中得出最终的答案，这儿的逻辑就是找出最大值

All the reduce function has to do now is iterate through the list and pick up the maximum reading.
(1949, 111)
(1950, 22)

这就是一个完整的MapReduce的过程，还是比较简单和容易理解的。

Java MapReduce

Having run through how the MapReduce program works, the next step is to express it in code. We need three things: a map function, a reduce function, and some code to run the job.

下面来看看用java怎么来写上面这个例子的过程,

Map function

The Mapper interface is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function.
For the present example, the input key is a long integer offset, the input value is a line of text, the output key is a year, and the output value is an air temperature (an integer).
Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package.
Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and IntWritable (like Java Integer).

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>

{
   private static final int MISSING = 9999;
   public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
     throws IOException {
       String line = value.toString();
       String year = line.substring(15, 19);
       int airTemperature;
       if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
           airTemperature = Integer.parseInt(line.substring(88, 92));
       } else {
           airTemperature = Integer.parseInt(line.substring(87, 92));
       }
       String quality = line.substring(92, 93);
       if (airTemperature != MISSING && quality.matches("[01459]")) {
           output.collect(new Text(year), new IntWritable(airTemperature));
       }
   }
}

Reduce function

The reduce function is similarly defined using a Reducer.
Again, four formal type parameters are used to specify the input and output types, this time for the reduce function.
The input types of the reduce function must match the output types of the map function: Text and IntWritable.
And in this case, the output types of the reduce function are Text and IntWritable, for a year and its maximum temperature, which we find by iterating through the temperatures and comparing each
with a record of the highest found so far.

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter)
      throws IOException {
        int maxValue = Integer.MIN_VALUE;
        while (values.hasNext()) {
            maxValue = Math.max(maxValue, values.next().get());
        }
        output.collect(key, new IntWritable(maxValue));
     }
}

Run the job

A JobConf object forms the specification of the job. It gives you control over how the job is run.
1. 指定Job代码的JAR file, 但是不需要指定JAR file name, 而是给出class name, hadoop会自己去locate包含这个class的JAR file. 为什么要这么麻烦? 便于修改JAR file name?

When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop will distribute around the cluster).
Rather than explicitly specify the name of the JAR file, we can pass a class in the JobConf constructor, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing
this class.

2. 指定Job的输入和输出路径
Having constructed a JobConf object, we specify the input and output paths.
An input path is specified by calling the static addInputPath() method on FileInputFormat, and it can be a single file, a directory (in which case, the input forms all the files in that directory), or a file pattern. As the name suggests, addInputPath() can be called more than once to use input from multiple paths.

The output path (of which there is only one) is specified by the static setOutput Path() method on FileOutputFormat. It specifies a directory where the output files from the reducer functions are written. The directory shouldn’t exist before running the job, as Hadoop will complain and not run the job. This precaution is to prevent data loss (it can be very annoying to accidentally overwrite the output of a long job with another).

3. 指定Mapper和Reducer Class
Next, we specify the map and reduce types to use via the setMapperClass() and setReducerClass() methods.

4. 指定map和reduce输出的类型
The setOutputKeyClass() and setOutputValueClass() methods control the output types for the map and the reduce functions, which are often the same, as they are in our case.
If they are different, then the map output types can be set using the methods setMapOutputKeyClass() and setMapOutputValueClass().
The input types are controlled via the input format, which we have not explicitly set since we are using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to run the job.

5. Run
The static runJob() method on JobClient submits the job and waits for it to finish, writing information about its progress to the console.

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class MaxTemperature {
    public static void main(String[] args) throws IOException {
        if (args.length != 2) {
            System.err.println("Usage: MaxTemperature <input path> <output path>");
            System.exit(-1);
        }

        JobConf conf = new JobConf(MaxTemperature.class);
        conf.setJobName("Max temperature");

        FileInputFormat.addInputPath(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        conf.setMapperClass(MaxTemperatureMapper.class);
        conf.setReducerClass(MaxTemperatureReducer.class);

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        JobClient.runJob(conf);
    }
}

The new Java MapReduce API

Release 0.20.0 of Hadoop included a new Java MapReduce API, sometimes referred to as “Context Objects,” designed to make the API easier to evolve in the future.
The new API is type-incompatible with the old, however, so applications need to be rewritten to take advantage of it.
There are several notable differences between the two APIs:
• The new API favors abstract classes over interfaces, since these are easier to evolve.
For example, you can add a method (with a default implementation) to an abstract class without breaking old implementations of the class. In the new API, the Mapper and Reducer interfaces are now abstract classes.
• The new API is in the org.apache.hadoop.mapreduce package (and subpackages). The old API can still be found in org.apache.hadoop.mapred.
• The new API makes extensive use of context objects that allow the user code to communicate with the MapReduce system. The MapContext, for example, essentially unifies the role of the JobConf, the OutputCollector, and the Reporter.
• The new API supports both a “push” and a “pull” style of iteration.
In both APIs, ey-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method.
The same goes for the reducer. An example of how the “pull” style can be useful is processing records in batches, rather than one by one.

• Configuration has been unified. The old API has a special JobConf object for job configuration, which is an extension of Hadoop’s vanilla Configuration object (used for configuring daemons; see “The Configuration API” on page 130). In the new API, this distinction is dropped, so job configuration is done through a Configuration.
• Job control is performed through the Job class, rather than JobClient, which no longer exists in the new API.
• Output files are named slightly differently: part-m-nnnnn for map outputs, and partr-nnnnn for reduce outputs (where nnnnn is an integer designating the part number, starting from zero).

public class NewMaxTemperature {
    static class NewMaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        private static final int MISSING = 9999;
        public void map(LongWritable key, Text value, Context context)
          throws IOException, InterruptedException {
            String line = value.toString();
            String year = line.substring(15, 19);
            int airTemperature;
            if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
                airTemperature = Integer.parseInt(line.substring(88, 92));
            } else {
                airTemperature = Integer.parseInt(line.substring(87, 92));
            }
            String quality = line.substring(92, 93);
            if (airTemperature != MISSING && quality.matches("[01459]")) {
                context.write(new Text(year), new IntWritable(airTemperature));
            }
       }
   }

   static class NewMaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
       public void reduce(Text key, Iterable<IntWritable> values, Context context)
         throws IOException, InterruptedException {
             int maxValue = Integer.MIN_VALUE;
             for (IntWritable value : values) {
                 maxValue = Math.max(maxValue, value.get());
             }
             context.write(key, new IntWritable(maxValue));
      }
   }

   public static void main(String[] args) throws Exception {
       if (args.length != 2) {
           System.err.println("Usage: NewMaxTemperature <input path> <output path>");
           System.exit(-1);
       }
       Job job = new Job();
       job.setJarByClass(NewMaxTemperature.class);

       FileInputFormat.addInputPath(job, new Path(args[0]));
       FileOutputFormat.setOutputPath(job, new Path(args[1]));
       job.setMapperClass(NewMaxTemperatureMapper.class);
       job.setReducerClass(NewMaxTemperatureReducer.class);
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(IntWritable.class);

       System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

Scaling Out

You’ve seen how MapReduce works for small inputs; now it’s time to take a bird’s-eye view of the system and look at the data flow for large inputs.
For simplicity, the examples so far have used files on the local filesystem.
However, to scale out, we need to store the data in a distributed filesystem, typically HDFS (which you’ll learn about in the next chapter), to allow Hadoop to move the MapReduce computation to each machine hosting a part of the data. Let’s see how this works.

MapReduce Data Flow

A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data , the MapReduce program , and configuration information.
Hadoop runs the job by dividing it into tasks , of which there are two types: map tasks and reduce tasks .
一个M/R job由”输入数据, M/R代码, 和配置信息”组成. 而job又可以分为, map和reduce两种

There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers .The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
中心化设计, 主节点, jobtracker 和一系列任务节点tasktrackers. 主节点用于负责管理和协调任务节点

Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits.
Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split.
Having many splits means the time taken to process each split is small compared to the time to process the whole input.
分布式处理, M/R, 首先就是要数据划分, 划分开了才能并发的处理, 划分多大算合适?

So if we are processing the splits in parallel, the processing is better load-balanced if the splits are small, since a faster machine will be able to process proportionally more splits over the course of the job than a slower machine. Even if the machines are identical, failed processes or other jobs running concurrently make load balancing desirable, and the quality of the load balancing increases as the splits become more fine-grained.
On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time.
splits划分的比较细, 有利于load-balanced和failed processes
而划分的比较粗的话, 问题是增加了task管理的overhead

For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the cluster (for all newly created files), or specified when each file is created.
所以折衷的结果就是, 对于大多数job, split默认划分为64M, 比较合理

Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization.
It should now be clear why the optimal split size is the same as the block size: it is the largest size of input that can be guaranteed to be stored on a single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some of the split would have to be transferred across the network to the node running the map task, which is clearly less efficient than running the whole map task using local data.
为什么64M比较合理, 因为block size也是64M, 而HDFS上存储的单位就是block, 为了满足data locality optimization, 如果超出64M, 无法保证多个block在存储在同一个node上

Map tasks write their output to the local disk, not to HDFS. Why is this?
Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be overkill. If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to re-create the map output.

Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is normally the output from all mappers.
In the present example, we have a single reduce task that is fed by all of the map tasks. Therefore, the sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability.
As explained in Chapter 3, for each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes. Thus, writing the reduce output does consume network bandwidth, but only as much as a normal HDFS write pipeline consumes.
Reducer只有比较少的几个，而且需要把所有mapper的数据通过网络传给reducer，而mapper是具有data locality optimization特性，就是需要处理的数据都在本地，所以mapper的任务就是从大量的未处理的数据提取需要的数据，并进行预处理，让传给reducer的数据尽量的简单。
而Reducer产生的结果也是最终的处理结果, 所以需要存储到HDFS上...

The whole data flow with a single reduce task is illustrated in Figure 2-2. The dotted boxes indicate nodes, the light arrows show data transfers on a node, and the heavy arrows show data transfers between nodes.

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task.
There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. 简单的hash就可以做到
The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very well.

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-3.
This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as “the shuffle,” as each reduce task is fed by many map tasks.
The shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in “Shuffle and Sort” on page 177.

Finally, it’s also possible to have zero reduce tasks.
This can be appropriate when you don’t need the shuffle since the processing can be carried out entirely in parallel (a few examples are discussed in “NLineInputFormat” on page 211). In this case, the only off-node data transfer is when the map tasks write to HDFS (see Figure 2-4).

Combiner Functions

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks.
Hadoop allows the user to specify a combiner function to be run on the map output—the combiner function’s output forms the input to the reduce function.
Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.

目的就是优化, 为了进一步减少map到reduce的网络传输量，我们可以在map后加上combiner，相当于在local先run一下reduce逻辑，进一步减少需要传输的数据。看看下面的例子

The first map produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
And the second produced:
(1950, 25)
(1950, 15)
reduce function input：
(1950, [0, 20, 10, 25, 15])
加上combiner的效果是，对每个mapper的数据做预处理，在每个mapper node上先滤出最大的，再发给reducer，这样可以大大减少网络传输量。
reduce function 输入由(1950, [0, 20, 10, 25, 15])减少成(1950, [20, 25])

Not all functions possess this property. For example, if we were calculating mean temperatures, then we couldn’t use the mean as our combiner function.
The combiner function doesn’t replace the reduce function. (How could it? The reduce function is still needed to process records with the same key from different maps.)
But it can help cut down the amount of data shuffled between the maps and the reduces, and for this reason alone it is always worth considering whether you can use a combiner function in your MapReduce job.

设置的方法也很简单

conf.setMapperClass(MaxTemperatureMapper.class);
conf.setCombinerClass(MaxTemperatureReducer.class);
conf.setReducerClass(MaxTemperatureReducer.class);

Hadoop Streaming

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java.
Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.
Streaming is naturally suited for text processing (although, as of version 0.21.0, it can handle binary streams, too), and when used in text mode, it has a line-oriented view of data. Map input data is passed over standard input to your map function, which processes it line by line and writes lines to standard output. A map output key-value pair is written as a single tab-delimited line. Input to the reduce function is in the same format—a tab-separated key-value pair—passed over standard input. The reduce function reads lines from standard input, which the framework guarantees are sorted by key, and writes its results to standard output.

我比较喜欢用python，下面就给出一个python的例子，
python
Example 2-10. Map function for maximum temperature in Python

#!/usr/bin/env python 
import re 
import sys 
for line in sys.stdin: 
    val = line.strip() 
    (year, temp, q) = (val[15:19], val[87:92], val[92:93]) 
    if (temp != "+9999" and re.match("[01459]", q)): 
        print "%s\t%s" % (year, temp)

Example 2-11. Reduce function for maximum temperature in Python

#!/usr/bin/env python 
import sys 
(last_key, max_val) = (None, 0) 
for line in sys.stdin: 
    (key, val) = line.strip().split("\t") 
    if last_key and last_key != key: 
        print "%s\t%s" % (last_key, max_val) ＃因为framework再传给reducer之前会按key排序，所以这个逻辑才成立 
        (last_key, max_val) = (key, int(val)) 
    else:

        (last_key, max_val) = (key, max(max_val, int(val))) 
if last_key: 
print "%s\t%s" % (last_key, max_val)

你可以用这种简单的方法来测试你写的map，reduce函数是否正确

% cat input/ncdc/sample.txt | src/main/ch02/python/max_temperature_map.py | \ 
sort | src/main/ch02/python/max_temperature_reduce.py 
1949 111 
1950 22

用Hadoop Streaming来运行你的mapreduce过程

% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ 
-input input/ncdc/sample.txt \ 
-output output \ 
-mapper src/main/ch02/python/max_temperature_map.py \ 
-reducer src/main/ch02/python/max_temperature_reduce.py

更加具体的用python写mapreduce的例子参考这个blog

http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

现在由写基于streaming API开放的python Hadoop平台, 直接使用很方便

mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.

mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster.

https://github.com/Yelp/mrjob

Dumbo is a project that allows you to easily write and run Hadoop programs in Python (it’s named after Disney’s flying circus elephant, since the logo of Hadoop is an elephant and Python was named after the BBC series “Monty Python’s Flying Circus”). More generally, Dumbo can be considered to be a convenient Python API for writing MapReduce programs.

https://github.com/klbostee/dumbo/wiki

本文章摘自博客园，原文发布日期：2011-07-04

Hadoop TDG 2 -- introduction