MapReduce

MapReduce 是一种分布式计算框架，采用分而治之的思想处理大规模数据。

概述

MapReduce 将复杂的计算任务分解为两个主要阶段：Map 阶段和 Reduce 阶段。

执行流程

Split: 将输入数据分割成多个分片
Map: 对每个分片进行映射处理
Shuffle: 对 Map 输出进行排序和分组
Reduce: 对分组后的数据进行汇总

WordCount 示例

java

public class WordCount {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        
        public void map(LongWritable key, Text value, Context context) 
            throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }
    
    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
            throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
}

Map 阶段

Map 阶段负责将输入数据转换为键值对：

输入：<偏移量, 行内容>
输出：<单词, 1>

Reduce 阶段

Reduce 阶段负责对 Map 输出进行汇总：

输入：<单词, [1, 1, 1, ...]>
输出：<单词, 计数>

Shuffle 阶段

Shuffle 阶段是 Map 和 Reduce 之间的桥梁：

排序: 对 Map 输出按键排序
分区: 将数据分配到不同的 Reduce 任务
合并: 对相同键的值进行合并

优点与局限性

优点

可扩展性: 支持大规模集群
容错性: 自动处理节点故障
简单易用: 编程模型简单

局限性

延迟较高: 适合批处理，不适合实时计算
中间数据较多: 需要大量磁盘 I/O

MapReduce ​

概述 ​

执行流程 ​

WordCount 示例 ​

Map 阶段 ​

Reduce 阶段 ​

Shuffle 阶段 ​

优点与局限性 ​

优点 ​

局限性 ​

MapReduce

概述

执行流程

WordCount 示例

Map 阶段

Reduce 阶段

Shuffle 阶段

优点与局限性

优点

局限性