Hadoop Professionals

A Community for Hadoop Users

What it's gonna happend when it comes to large number of maps?

Hello everyone,

I know when map func generates intermediate output, reduce func will pull data directly from all maps' local disk. Although we can use
combiner func to minimize the amount of data, when we have many mappers,
say 10,000, that will be a crazy IO headache. And that dosen't seem
right.


Can anyone highlighten me on this?

Regards,
Elton

Views: 0

Reply to This

Replies to This Discussion

I regularly run jobs with 20k map tasks.
The shuffle can take quite a while, and if the jobs pass a lot of data to the reduce phase, load down the networking layer substantially.

It does just work though.
Thanks for reply Jason,

Hmmm... so I think that point worths some optimization. Maybe the combiner can be extended a bit to, say, rack level, so intermediate output produced from nodes on the same rack can be merged and stored (somewhere?) before pulled by reducers.

What do you think?

Elton

Reply to Discussion

RSS




Groups

© 2012   Created by Jason Venner.

Badges  |  Report an Issue  |  Terms of Service