Hadoop Professionals

A Community for Hadoop Users

Elton Tian

How Hadoop MR delete maps output files in reduce's local disk when clean up?

Hello all,

I am going thru source code of mapreduce part. For experiment purpose, I try to retain the tmp directories created on node's local file system when a mapreduce job is running, i.e. "map.local.dir" + mapred/local/tasktracker/jobcache/job_xxxxxxxx/ .

So I commented out some functions for cleaning up, like TaskTracker.TaskInProgress.cleanup(), TaskTracker.startcleanupThreads(), Task.taskcleanup(). And I can retain all attempt folders, jars folder, work folder and job.xml.

The problem is the output folder in reduce attempt folders are always empty when the job finishes. That folder is supposed to contain all map outputs pulled by reduce task. I dig into the source code and found the problem is from the execution of the reduce function. In ReduceTask.runOldReducer(), there's a while loop going thru all keys in ReduceValuesIterator and execute the reduce function I defined. If I comment this loop out, map output files in the output folder will stay. Otherwise they would be deleted...

It seems weird here. I have no idea how this folder is cleaned while reduce is running rather than in clean up phase. And I couldn't find any code referring to this. Anyone has better idea on this?

Cheers,
Elton


Tags: cleanup, map, output

Views: 23

Reply to This

Replies to This Discussion

PS: I have tried to set "keep.task.files.pattern" to "attemp_*", in order to keep all the attemp folders in local/tasktracker/jobcache folder. Still the same ....

Reply to Discussion

RSS




Groups

© 2012   Created by Jason Venner.

Badges  |  Report an Issue  |  Terms of Service