We have lots of servers but have limited storage pool. My map jobs are handle lots of small input files (approx 300Mb Compressed) but the reduce input is huge ( about 100Gb) requiring lots of temporary and local storage. I would like to divide my server pool into two kinds - one set with a small disks ( for map jobs) and a few with big storage ( for the combine and reduce jobs).
Is there something I can do that lets me force the reduce job to run on a specific nodes?
I have done google searching and searching through some forums but not found.
There does not appear to be any simple way to split the pools.
I have had similar job constraints - building solr indexes in the reduce step.
There may be a way to do this with the scheduling system.
The first solution that comes to mind is to run 2 mapreduce clusters on the same set of machines - this will require reconfiguring the port space.
What I might suggest, if you have full control over your cluster configuration is to run break your reduce into 2 parts.
The first part simply outputs the ordered files ala identity reduce, this job runs across all of your machines.
The job shouldn't have a large on disk foot print, the reduce output is written to HDFS.
The portion of your original reduce that actually did the work, becomes a stand alone map task, which you launch through the separate mapreduce cluster that only runs on the nodes that have space.
You can use the mapred.local.dir.minspacestart parameter which appears to still be present in 0.20.1 to prevent tasks from starting on nodes that have insufficient local space.
Thanks Jason. I do have full control over my cluster , so I did do what you suggested, split the cluster into two and run two map/reduce jobs. It was a nuisance but worked.