Hadoop Professionals

A Community for Hadoop Users

do all mappers finish before reducer starts

I just have a conceptual question. My understanding is that all the mappers have to complete their job for the reducers to start working because mappers dont know about each other so we need values for a given key from all the different mappers so we have to wait until all mappers have collectively given the system all possible values for a key .so that then that can be passed on the reducer .. 
but when I ran these jobs .. almost everytime before the mappers are all done the reducers start working .. so it would say map 60% reduce 30% .. how does this works 
Does it finds all possibly values for a single key from all mappers .. pass that on the reducer and then works on other keys 
any help is appreciated

Tags: hadoop

Views: 262

Reply to This

Replies to This Discussion

A reducer can not start until all of the data that will be it's input is fully ordered.
The ordering can not complete until all of the map tasks have finished, as any map may have data that will go to any reducer (reduce task).

The reduce task often starts at job start, but the first call to the user's reduce method will only happen after all of the map tasks have completed.
Thanks for your reply Jason .. and that makes perfect sense .. only thing is that when the job is runnung .. before the map tasks reach 100% status .. reducer task starts running as well ..so you will see something like this in terminal
***** map 0% reduce 0%
***** map 30% reduce 0%
***** map 60% reduce 20%
***** map 80% reduce 40%
***** map 100% reduce 60%
***** map 100% reduce 80%
***** map 100% reduce 100%

so as you can see the reducer started working when the mapper task was still at 60%
Other question that I have is that I noticed that while processing the same amount of data .. hadoop performs better/faster if the same data is merged into a big single file instead of a bunch of small files. What exactly is the reason for that ???

Thanks again for your help


Jason Venner said:
A reducer can not start until all of the data that will be it's input is fully ordered.
The ordering can not complete until all of the map tasks have finished, as any map may have data that will go to any reducer (reduce task).

The reduce task often starts at job start, but the first call to the user's reduce method will only happen after all of the map tasks have completed.
For all intents and purposes your reduce doesn't start until the reduce % hits 60%
the parts that run prior to that are involved in preparing the data for your reduce tasks.
It the job output is a confusing information presentation.
Reduce operation has 3 stages : copy,sort and then the actual-reduce is performed on copied/sorted data. Copy/Sort can start even before mapper has finished.

Jason Venner said:
For all intents and purposes your reduce doesn't start until the reduce % hits 60%
the parts that run prior to that are involved in preparing the data for your reduce tasks.
It the job output is a confusing information presentation.
based on my understanding and experiences, the reducers starts before mappers finish.
because there are four stages in "reduce": copy phase, append phase, sort phase, reduce phase.
exactly speaking, append phase starts after all mappers finishes, but copy phase starts at the same time with mapper .
I am writing a temporary file from map function and would like to read in configure function of Reduce Class after all the maps are finished.

My Question is "How do I make the configure to run after all maps"

Reply to Discussion

RSS




Groups

© 2012   Created by Jason Venner.

Badges  |  Report an Issue  |  Terms of Service