MapReduce使用和原理（三）

发布时间：2026/5/20 15:09:56

Combiner预聚合Combiner是一个可选的优化步骤在Map任务输出结果后、Reduce输入前执行。其作用是对Map任务的输出进行局部合并将具有相同键的键值对合并为一个以减少需要传输到Reduce节点的数据量降低网络开销并提高整体性能。Combiner实际上是一种轻量级的Reduce操作用于减少数据在网络传输过程中的负担。需要注意的是Combiner的执行并不是强制的而是由开发人员根据具体情况决定是否使用一些情况下不适合使用Combiner例如对数据进行均值计算场景。在MapReduce中使用Combiner预聚合需要两个步骤1. 自定义类实现Reducer实现reduce方法完成聚合逻辑2. 在Driver中设置“job.setCombinerClass(YourCombiner.class)”在Map端使用Combiner预聚合下面对WordCount案例进行改造实现Map端进行相同单词的预聚合。1) 自定义类WordCountCombiner类实现Reducer类import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WordCountCombiner extends ReducerText, IntWritable,Text,IntWritable { //创建写出的value IntWritable total new IntWritable(); //每组key会调用一次 Override protected void reduce(Text key, IterableIntWritable values, ReducerText, IntWritable, Text, IntWritable.Context context) throws IOException, InterruptedException { int sum 0; //累加 for (IntWritable value : values) { sum value.get(); } //设置当前key对应value结果值 total.set(sum); //结果写出 context.write(key,total); } }自定义Reduce端分组比较器默认在MapReduce Reduce端每个key对应一组数据一个Redcue Task可以处理多组key默认哪些数据分配到相同的组就是按照key是否相等决定的。我们也可以通过在自定义分组比较器来决定将哪些数据看成同一个组进行处理相同key。使用自定义Redcue端分组比较器需要如下两个步骤1) 自定义Reduce端分组比较器2) 在Driver中通过”job.setGroupingComparatorClass(YourGroupingComparator.class)”进行设置。案例需求不使用自定义分组比较器实现1) Temperatureimport org.apache.hadoop.io.WritableComparable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; /** * 温度实体类 */ public class Temperature implements WritableComparableTemperature { private String year; private String month; private String day; private Integer temp; //空构造 public Temperature() { } //有参构造 public Temperature(String year, String month, String day, Integer temp) { this.year year; this.month month; this.day day; this.temp temp; } //getter setter public String getYear() { return year; } public void setYear(String year) { this.year year; } public String getMonth() { return month; } public void setMonth(String month) { this.month month; } public String getDay() { return day; } public void setDay(String day) { this.day day; } public Integer getTemp() { return temp; } public void setTemp(Integer temp) { this.temp temp; } //toString() Override public String toString() { return year - month - day \t temp; } //序列化与反序列化 Override public void write(DataOutput dataOutput) throws IOException { dataOutput.writeUTF(this.year); dataOutput.writeUTF(this.month); dataOutput.writeUTF(this.day); dataOutput.writeInt(this.temp); } Override public void readFields(DataInput dataInput) throws IOException { year dataInput.readUTF(); month dataInput.readUTF(); day dataInput.readUTF(); temp dataInput.readInt(); } //两个对象如何比较数据 Override public int compareTo(Temperature o) { //按照相同的年月、温度降序排序 int yearCompare this.getYear().compareTo(o.getYear()); int monthCompare this.getMonth().compareTo(o.getMonth()); if(yearCompare0){ if(monthCompare0){ //按照温度大的降序排序 return this.temp o.temp ? -1:1; } return monthCompare; } return yearCompare; } }4) TemperatureReducerimport org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; import java.util.*; public class TemperatureReducer extends ReducerTemperature, Temperature, Temperature,NullWritable { int cnt ; String year; String month; String day; //用来标记某个分区中是否处理过相同日期数据map中key为年月valueday,年月计数 HashMapString, String flagMap new HashMap(); //相同的key分为一组这里需要将分区中所有的数据拿在一起最后比较获取日期最大的数据 ArrayListTemperature list new ArrayList(); Override protected void reduce(Temperature key, IterableTemperature values, ReducerTemperature, Temperature, Temperature, NullWritable.Context context) throws IOException, InterruptedException { IteratorTemperature iterator values.iterator(); while(iterator.hasNext()){ Temperature next iterator.next(); list.add(next); } //最后比较得到温度较高的两条数据日期不能相同 for (Temperature temperature : list) { year temperature.getYear(); month temperature.getMonth(); day temperature.getDay(); //第一次处理某个年月日数据 if(!flagMap.containsKey(year-month)){ cnt 1 ; context.write(temperature,NullWritable.get()); flagMap.put(year-month,day,cnt); } //如果flagMap中包含年月数据判断value是不是同一日期是同一日期不输出不是同一日期输出数据 if(flagMap.containsKey(year-month)!day.equals(flagMap.get(year-month).split(,)[0])){ //获取当前年月记录的条数 cnt Integer.valueOf(flagMap.get(year - month).split(,)[1]); cnt 1; //说明当前年月下不够2条数据 if(cnt 2){ context.write(temperature,NullWritable.get()); } flagMap.put(year-month,day,cnt); } } } }使用自定义分组比较器实现相比以上代码使用自定义分区比较器首先需要自定义类继承WritableComparator抽象类并实现构造和compare方法在构造方法中需要调用父类构造传入排序对象类型及是否创建实例在compare方法中实现决定将哪些数据放入同一组的比较逻辑。自定义输出格式在MapReduce中Reduce写出数据时根据不同的OutputFormat格式化类来决定数据如何写出OutputFormat格式化类中通过getRecordWriter方法获取RecordWriter对象进而将数据通过RecordWriter.write()方法写出到外部系统默认写出格式类为TextOutputFormat该类继承自抽象类FileOutputFormatFileOutputFormat又继承自顶级的OutputFormat抽象类即一行行将数据写出到外部text文件中生成的文件名称为part-r-00000、part-r-00001... 如果我们想要改变写出文件名称也可以通过定义类继承FileOutputFormat抽象类并实现对应方法即可。自定义OutputFormat及使用自定义输出格式步骤如下1) 自定义类继承FileOutputFormat并实现getRecordWriter方法2) 在getRecordWriter方法中返回自定义RecordWriter类该类需要集成RecordWriter对象实现对应的数据写出逻辑。3) 在Driver中设置“job.setOutputFormatClass(YourOutputFormat.class)”使用自定义outputFormat。案例学生成绩数据studentscore.txt内容如下/** * 学员信息 */ public class StudentInfo implements WritableComparableStudentInfo { private String name; private int score; // 无参构造方法 public StudentInfo() { } // 带参构造方法 public StudentInfo(String name, int score) { this.name name; this.score score; } // Getter和Setter方法 public String getName() { return name; } public void setName(String name) { this.name name; } public int getScore() { return score; } public void setScore(int score) { this.score score; } Override public String toString() { return StudentInfo{ name name \ , score score }; } // 实现序列化方法 Override public void write(DataOutput out) throws IOException { out.writeUTF(name); out.writeInt(score); } // 实现反序列化方法 Override public void readFields(DataInput in) throws IOException { name in.readUTF(); score in.readInt(); } Override public int compareTo(StudentInfo o) { if(this.score o.score){ return -1; }else if(this.score o.score){ return 1; }else{ return 0; } } }4) MyOutputFormatMyOutputFormat类需要继承FileOutputFormat并实现getRecoreWriter方法返回RecordWriter对象完成自定义数据输出。import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.RecordWriter; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class MyOutputFormat extends FileOutputFormatStudentInfo, NullWritable { // 获取RecordWriter对象 Override public RecordWriterStudentInfo, NullWritable getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException { // 根据job来创建文件输出流需要传入job MyRecordWriter myRecordWriter new MyRecordWriter(job); return myRecordWriter; } } class MyRecordWriter extends RecordWriterStudentInfo,NullWritable{ private FSDataOutputStream passOutputStream; private FSDataOutputStream failOutputStream; //根据job来创建文件输出流 public MyRecordWriter(TaskAttemptContext job) throws IOException { FileSystem fileSystem FileSystem.get(job.getConfiguration()); // 创建及格成绩输出流 passOutputStream fileSystem.create(new Path(D:\\mapreduce\\pass.txt)); // 创建不及格成绩输出流 failOutputStream fileSystem.create(new Path(D:\\mapreduce\\fail.txt)); } //写出数据 Override public void write(StudentInfo key, NullWritable value) throws IOException, InterruptedException { int score key.getScore(); if(score 80){ passOutputStream.writeBytes(score\n); }else{ failOutputStream.writeBytes(score\n); } } //关闭资源 Override public void close(TaskAttemptContext context) throws IOException, InterruptedException { // 关闭输出流并释放资源 IOUtils.closeStreams(passOutputStream,failOutputStream); } }5) Driver在Driver中通过设置“job.setOutputFormatClass(MyOutputFormat.class)”指定自定义outputFormat在实现中指定了数据写出的文件另外FileOutputFormat.setOutputPath(...)指定的路径中会存放“_SUCCESS”标志文件。import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class ScoreDriver { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { //1.获取配置信息及job对象 Configuration conf new Configuration(); Job job Job.getInstance(conf); //2.设置Driver 程序对应的jar/类 job.setJarByClass(ScoreDriver.class); //3.设置Mapper和Reducer对应的类 job.setMapperClass(ScoreMapper.class); job.setReducerClass(ScoreReducer.class); //4.设置Mapper输出key、value类型 job.setMapOutputKeyClass(StudentInfo.class); job.setMapOutputValueClass(Text.class); //5.设置最终输出K,V类型 job.setOutputKeyClass(StudentInfo.class); job.setOutputValueClass(NullWritable.class); //设置自定义outputFormat job.setOutputFormatClass(MyOutputFormat.class); //6.设置数据输入和结果写出路径 FileInputFormat.setInputPaths(job,new Path(data/studentscore.txt)); //使用了自定义输出类结果数据会写入自定义输出类中指定的路径这里设置的目录只是最后写出的_success标记文件路径 FileOutputFormat.setOutputPath(job,new Path(output6/)); //7.运行任务运行成功返回true boolean success job.waitForCompletion(true); if (success) { // 任务执行成功的逻辑 System.out.println(任务执行成功); } else { // 任务执行失败的逻辑 System.out.println(任务执行失败); } } }

Windows安卓子系统终极指南：三步免费安装与高效使用教程

Windows安卓子系统终极指南：三步免费安装与高效使用教程【免费下载链接】WSA Developer-related issues and feature requests for Windows Subsystem for Android 项目地址: https://gitcode.com/gh_mirrors/ws/WSA 想在Windows电脑上无缝运行手机应用吗&a…

2026/5/20 15:08:52 阅读更多

嵌入式Qt移植实战：从交叉编译到i.MX6ULL开发板部署全记录

1. 项目概述最近在折腾一块基于i.MX6ULL的嵌入式开发板，想把一个用Qt写的图形界面程序跑上去。这听起来像是嵌入式开发的“标准操作”，但真动起手来，从交叉编译环境搭建、Qt库的裁剪编译，到最终在资源受限的板子上把程序跑起来&am…

2026/5/20 15:08:52 阅读更多

2026年B站下载工具终极选择：BiliTools跨平台工具箱完全指南

2026年B站下载工具终极选择：BiliTools跨平台工具箱完全指南【免费下载链接】BiliTools A cross-platform bilibili toolbox. 跨平台哔哩哔哩工具箱，支持下载视频、番剧等等各类资源项目地址: https://gitcode.com/GitHub_Trending/bilit/BiliTools …

2026/5/20 15:08:30 阅读更多

UrsPahoMqttClient 心跳问题解决指南——Paho 底层已自动处理，设好 KeepAlive 就行

UrsPahoMqttClient 心跳问题解决指南 ——Paho 底层已自动处理，设好 KeepAlive 就行问题用 UrsPahoMqttClient 做 MQTT 连接时，心跳 PingReq 报文怎么发送？目的是保持连接，防止被 Broker 踢下线。结论不需要手动发心跳&#x…

2026/5/20 16:04:39 阅读更多

基于ARM9工业平板与Linux的水质在线监测系统开发实践

1. 项目概述：当工业平板电脑遇上水质监测在环保、水产养殖、市政水务这些领域里，数据就是眼睛。过去，我们看水质，得靠人拿着采样瓶，一趟趟跑现场，再送回实验室，等上半天甚至几天才能拿到一份报告…

2026/5/20 16:04:39 阅读更多

猫抓浏览器扩展：5个步骤掌握终极网页资源嗅探工具

猫抓浏览器扩展：5个步骤掌握终极网页资源嗅探工具【免费下载链接】cat-catch 猫抓浏览器资源嗅探扩展 / cat-catch Browser Resource Sniffing Extension 项目地址: https://gitcode.com/GitHub_Trending/ca/cat-catch 猫抓（cat-catch&#xff…

2026/5/20 16:04:17 阅读更多

保姆级教程：用Ollama在Mac上跑通Llama2，顺便聊聊怎么自定义你的专属AI助手

从零打造你的Mac专属AI助手：Ollama与Llama2实战指南引言：为什么选择本地运行大语言模型？ 最近两年，大语言模型（LLM）的普及让AI助手变得触手可及。但大多数用户仍然依赖云端服务，这带来了隐私顾…

2026/5/20 16:03:57 阅读更多

基于RK3568与FPGA的16通道高速AD采集系统设计与实现

1. 项目概述：国产化浪潮下的实时数据采集新选择最近在做一个工业数据采集的项目，客户对国产化、实时性和通道数都有硬性要求。传统的方案要么用X86工控机加PCIe采集卡，成本高、功耗大，要么用一些ARM核心板，但多通道同步…

2026/5/20 16:03:36 阅读更多

FL Studio自带的Edison插件，才是隐藏的降噪神器！手把手教你清除录音底噪（含参数设置避坑指南）

FL Studio隐藏神器Edison：专业级降噪全流程实战指南在家庭录音棚里，空调的嗡嗡声、电脑风扇的呼啸、电路底噪的嘶嘶声——这些不受欢迎的"伴奏"总是如影随形。当你在FL Studio中回放刚录制的人声或乐器时，这些背景噪音往往会毁掉整…

2026/5/20 16:03:15 阅读更多

顶伯在线语音工具背后的技术力量：AI语音合成与深度学习解析

顶伯在线语音工具背后的技术力量在人工智能浪潮中，语音交互正成为人机沟通的核心方式。顶伯作为行业领先的在线语音工具，凭借自主研发的深度学习架构，将文字转化为高度自然的语音，广泛应用于有声阅读、智能客服、教育辅助等领域。…

2026/5/20 0:00:25 阅读更多

全志V3s开发板实战：用Buildroot 2020.02.4定制你的第一个最小Linux文件系统

全志V3s开发板实战：用Buildroot 2020.02.4定制最小Linux文件系统在嵌入式开发领域，构建一个精简高效的Linux文件系统往往是项目成功的关键第一步。全志V3s作为一款高性价比的ARM Cortex-A7芯片，搭配Buildroot这一经典构建工具，能…

2026/5/20 0:00:25 阅读更多

百考通：AI赋能期刊论文写作，智能生成优质内容

在学术研究领域，期刊论文的撰写是成果输出的关键环节，却也让众多科研工作者与学生倍感压力：选题迷茫、逻辑梳理困难、格式规范复杂、内容提炼耗时，严重拖慢了学术成果的发表节奏。百考通（https://www.baikaotongai.com…

2026/5/20 0:00:46 阅读更多

【实用小程序】超轻量级文件上传下载中心 (File Download Server)

站内源码及jar包下载一、项目概述文件下载中心一个基于 Java 内置 HTTP 服务器（com.sun.net.httpserver）构建的轻量级文件管理服务。它零第三方依赖，单 JAR 包即可运行，适合在内网环境或临时场景中快速搭建文件共享站点。你的团队需要临时共享一批日志文件或交付物，…

2026/5/20 5:14:40 阅读更多

py每日spider案例之某website之xin东方选课搜索接口(难度一般扣取代码即可)

加密位置: 逆向接口参数: 逆向接口: const g = globalThis; g.window = g; g.self = g; g.location = {<

2026/5/19 6:17:20 阅读更多

终极轻量级Android文本编辑器Markor：多格式笔记应用完全指南

终极轻量级Android文本编辑器Markor：多格式笔记应用完全指南【免费下载链接】markor Text editor - Notes & ToDo (for Android) - Markdown, todo.txt, plaintext, math, .. 项目地址: https://gitcode.com/gh_mirrors/ma/markor 在移动设备上寻找一款…

2026/5/20 2:02:06 阅读更多

MPC-BE：基于DirectShow架构的专业级开源媒体播放解决方案

MPC-BE：基于DirectShow架构的专业级开源媒体播放解决方案【免费下载链接】MPC-BE MPC-BE – универсальный проигрыватель аудио и видеофайлов для операционной системы Windows. 项目地址:…

2026/5/20 5:46:58 阅读更多

如何快速计算3D模型体积和重量：STL-Volume-Model-Calculator终极指南

如何快速计算3D模型体积和重量：STL-Volume-Model-Calculator终极指南【免费下载链接】STL-Volume-Model-Calculator STL Volume Model Calculator Python 项目地址: https://gitcode.com/gh_mirrors/st/STL-Volume-Model-Calculator 你是否曾经为3D打印项目…

2026/5/20 3:00:53 阅读更多

通过Taotoken CLI工具一键配置团队开发环境与模型密钥

通过Taotoken CLI工具一键配置团队开发环境与模型密钥 1. CLI工具安装与基本使用 Taotoken提供的CLI工具可通过npm全局安装或直接使用npx运行。对于需要频繁使用CLI的团队，推荐全局安装： npm install -g taotoken/taotoken对于临时使用或项目级配置&a…

2026/5/19 22:33:20 阅读更多

相关文章