I know when map func generates intermediate output, reduce func
will pull data directly from all maps' local disk. Although we can use
combiner func to minimize the amount of data, when we have many mappers,
say 10,000, that will be a crazy IO headache. And that dosen't seem
right.
I regularly run jobs with 20k map tasks.
The shuffle can take quite a while, and if the jobs pass a lot of data to the reduce phase, load down the networking layer substantially.
Hmmm... so I think that point worths some optimization. Maybe the combiner can be extended a bit to, say, rack level, so intermediate output produced from nodes on the same rack can be merged and stored (somewhere?) before pulled by reducers.