找回密码
 FreeOZ用户注册
查看: 1773|回复: 5
打印 上一主题 下一主题

[学习深造] [Big data] Map reduce question

[复制链接]
跳转到指定楼层
1#
发表于 8-4-2016 08:04:02 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有帐号?FreeOZ用户注册

x
Good morning,

I read this url http://www.thegeekstuff.com/2014/05/map-reduce-algorithm/ related to the first example
------------------------------------------------------------------------
Mapping Phase

So our map phase of our algorithm will be as follows:

1. Declare a function “Map”
2. Loop: For each words equal to “football”
3. Increment counter
4. Return key value “football”=>counter

Reducing Phase

The reducing function will accept the input from all these mappers in form of key value pair and then processing it. So, input to the reduce function will look like the following:

reduce(“football”=>2)
reduce(“Olympics”=>3)
Our algorithm will continue with the following steps:

5. Declare a function reduce to accept the values from map function.
6. Where for each key-value pair, add value to counter.
7. Return “games”=> counter.

At the end, we will get the output like “games”=>5.
------------------------------------------------------------------------

My question is why not direct map to games? This will be more efficient

Mapping Phase

So our map phase of our algorithm will be as follows:

1. Declare a function “Map”
2. Loop: For each words equal to “football”
3. Increment counter
4. Return key value “games”=>counter

TIA

回复  

使用道具 举报

2#
发表于 8-4-2016 13:56:20 | 只看该作者
单就这个具体的问题而已,直接Map到GAME是比较省空间,但是作者的意思是要你体会(K1,V1) -> (K2,V2) -> (K3, V3)的变化,这个也是M&R的标准模式。简化版当然就是(K1,V1) -> (K3,V3)了,也就是之间到GAME。
另外,你要知道,MAP的意思就是保持数据原来的SEMANTICS,SHUFFLE之后,在COMBINE和REDUCE的阶段才引入输出的SEMANTICS,所以例子这样安排也是符合惯用做法的。很多时候MAP的中间结果是需要JOIN,OUTJOIN的,所以保持K2的中间结果在数据处理上更加灵活。

考察你一个问题,你知道在这个例子中K1,K2,K3分别是什么吗?
回复  

使用道具 举报

3#
 楼主| 发表于 8-4-2016 14:32:47 | 只看该作者
本帖最后由 DDD888 于 8-4-2016 13:59 编辑
michaelsusu 发表于 8-4-2016 12:56
单就这个具体的问题而已,直接Map到GAME是比较省空间,但是作者的意思是要你体会(K1,V1) -> (K2,V2) -> (K3 ...


The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
回复  

使用道具 举报

4#
 楼主| 发表于 8-4-2016 21:00:28 | 只看该作者
本帖最后由 DDD888 于 8-4-2016 20:02 编辑

顺便问一下,如果我要求平均数

Your Mappers read the text file and apply the following map function on every line

map: (key, value)
  time = value[2]
  emit("1", time)
All map calls emit the key "1" which will be processed by one single reduce function

reduce: (key, value)
  result = sum(value) / n
  emit("1", result)
Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.
http://stackoverflow.com/questio ... job-to-find-average

根据这个连结的回答,我要写两个job, 一个job计算累加总和,另一个job计算有多少数据,然后相应相除即可,但我想,两个job,读取同样的文件,是不是效率太低啦?如果该文件有几千个tb大

我的测试数据文件只有69kb,所以也看不出啥慢来
回复  

使用道具 举报

5#
发表于 8-4-2016 23:27:36 | 只看该作者
DDD888 发表于 8-4-2016 20:00
顺便问一下,如果我要求平均数

Your Mappers read the text file and apply the following map function ...


说了半天你用一个69kb的文件测试,诚意不够啊。。。。
回复  

使用道具 举报

6#
 楼主| 发表于 9-4-2016 07:49:28 | 只看该作者
ubuntuhk 发表于 8-4-2016 22:27
说了半天你用一个69kb的文件测试,诚意不够啊。。。。

我自己只有这样大的log文件可以测试啦

对学习写程序来说应该够啦,现在网上demo将/etc/passwd文件测试,那文件更小啦,我的文件69kb相比之下算大的啦
回复  

使用道具 举报

您需要登录后才可以回帖 登录 | FreeOZ用户注册

本版积分规则

小黑屋|手机版|Archiver|FreeOZ论坛

GMT+11, 12-2-2025 04:30 , Processed in 0.021174 second(s), 20 queries , Gzip On, Redis On.

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表