Hadoop Professionals

A Community for Hadoop Users

wang zhengkui

Wang zhengkui's Friends

  • Sunil Kumar Singh
  • Paulo Henrique Ramos
  • Jason Venner

wang zhengkui's Groups

wang zhengkui's Discussions

Failed to report status error solution

HI, I am facing a problem during job execution. The exception is like: "Task attempt_20110404183_0009_r_0000_0" failed to report status for 600 seconds. killing! I think the exception is because I…Continue

Started Apr 18, 2011

Is there any difference between the following two cases?
1 Reply

My application needs to have very big computation requirement, which means that the processed data size is not big, but the computation is complicated. What I would like to ask is if I set one…Continue

Started this discussion. Last reply by Jason Venner May 15, 2010.

An error occurs when did long time processing

Dear all,  when I use my program to do small data processing, it works well. But when I use the program to do the longer time processing, it throws the errors. Does anyone have any idea why this…Continue

Started May 13, 2010

How many reducers should I set suitably?
1 Reply

I want to know for a job If I have know the datasize, how many reducers do I need to set? How to count the suitable value of reducers number in my program? Is there any formulation of it? Thanks!

Started this discussion. Last reply by Jason Venner Oct 30, 2009.

Gifts Received

Gift

wang zhengkui has not received any gifts yet

Give a Gift

 

wang zhengkui's Page

Latest Activity

Profile Icon

Failed to report status error solution

HI, I am facing a problem during job execution. The exception is like: "Task attempt_20110404183_0009_r_0000_0" failed to report status for 600 seconds. killing! I think the exception is because I have a very time consuming operations in the Close() function in the reducers. I think one of the solutions is: change the "mapred.task.timeout" to a bigger values. But I am still wondering, is there any other way to solve this problem? Thanks. My application requirement is: I need to get all the…See More
Discussion posted by wang zhengkui Apr 18, 2011
Profile Icon
ThumbnailThumbnail
wang zhengkui is now friends with Paulo Henrique Ramos and Sunil Kumar Singh Apr 4, 2011
Profile Icon
Jason Venner replied to wang zhengkui's discussion 'Is there any difference between the following two cases?'
Tuning the reduce phase on a cluster is not a trivial problem. The common case is that the reduce phase is primarily disk IO bound, and you run roughly one reduce per seek arm on a machine. If the disk io is not the bounding point, you have a…
May 15, 2010
Profile Icon
Discussions posted by wang zhengkui May 13, 2010
Profile Icon

HBase Users

Thumbnail
A group for HBase users to share use cases, solutions and problems.
wang zhengkui joined Jason Venner's group Feb 3, 2010
Profile Icon
Jason Venner replied to wang zhengkui's discussion 'How many reducers should I set suitably?'
Generally speaking the number of reducers you choose is dependent on what your are going to do with the final output, the reduce capacity of your cluster, the amount of data needing to be reduced, and the time needed to perform the reduce. For me I…
Oct 30, 2009
Profile Icon

How many reducers should I set suitably?

I want to know for a job If I have know the datasize, how many reducers do I need to set? How to count the suitable value of reducers number in my program? Is there any formulation of it? Thanks!
Discussion posted by wang zhengkui Oct 22, 2009
Profile Icon
Jason Venner replied to wang zhengkui's discussion 'How to disable sort in hadoop'
If your number of reduce tasks is not 0, the hadoop framework will sort your results. there is no way around it.
Oct 9, 2009
Profile Icon
wang zhengkui replied to wang zhengkui's discussion 'How to disable sort in hadoop'
Thanks Jason. Does this mean, if my numReduceTask doesn`t equal to 0, hadoop must sort the intermediate result? If my reduce number doesn`t equal to 0, is there anyway I do not let it sort my intermediate result?
Oct 9, 2009
Profile Icon
Jason Venner replied to wang zhengkui's discussion 'How to disable sort in hadoop'
If you set the number of reduce tasks to 0, there will be no sorting. There will also be no reduce phase. In hadoop through 19, the JobConf object provides a method setNumReduceTasks, and the parameter behind it is mapred.reduce.tasks. I do not…
Oct 8, 2009
Profile Icon

How to disable sort in hadoop

Dear all, If in my application, I do not need the hadoop to sort the intermediate result for me. How can I disable the sort in the application? Because sorting needs time. But actually, I don`t want it to be sorted. Thanks!
Discussion posted by wang zhengkui Oct 8, 2009
Profile Icon
Jason Venner left a comment for wang zhengkui
Is there a reason you want to get multiple map outputs to a single reduce task? Do you want the data to be fully sorted and grouped by key? The simplest way is to change the partitioner class so that you get all of the data you want in one single…
Oct 6, 2009
Profile Icon
amogh vasekar replied to wang zhengkui's discussion 'Two requirements for Hadoop'
To write into multiple partitions, please look at pig's skewed join implementation of partitioner. I believe they do something pretty similar. However, .20onwards reducers will have to be set. Hence, it might break your implementation. Coming…
Sep 23, 2009
Profile Icon
Jason Venner replied to wang zhengkui's discussion 'Two requirements for Hadoop'
What you can do is, in your mapper open additional files than are input, which you may output anywhere. As an alternative you could write all of your map outputs via a MultipleFileOutput format in the map task, and only output the filenames to the…
Sep 22, 2009
Profile Icon

Two requirements on Hadoop

There are two requirements which I want to implement based on Hadoop. But , by now, I do not think that hadoop support them now. I am looking forward to your suggestion how to implement these.Firstly, if I want to let the reducers to fetch more partitions files from map out put, is that ok? For instance, now reducer one can fetch all the partition 1 from mappers, how I implement that reducer one can fetch all the partition 1 and also 2 to go to reducer 1? If can , How could I implement…See More
Blog post by wang zhengkui Sep 16, 2009
Profile Icon

Two requirements for Hadoop

There are two requirements which I want to implement based on Hadoop. But , by now, I do not think that hadoop support them now. I am looking forward to your suggestion how to implement these.Firstly, if I want to let the reducers to fetch more partitions files from map out put, is that ok? For instance, now reducer one can fetch all the partition 1 from mappers, how I implement that reducer one can fetch all the partition 1 and also 2 to go to reducer 1? If can , How could I implement…See More
Discussion posted by wang zhengkui Sep 16, 2009

Profile Information

Hadoop Experience Level
Intermediate
Available for Consulting
Yes

Wang zhengkui's Blog

wang zhengkui

Two requirements on Hadoop

There are two requirements which I want to implement based on Hadoop. But , by now, I do not think that hadoop support them now. I am looking forward to your suggestion how to implement these.



Firstly, if I want to let the reducers to fetch more partitions files from map out put, is that ok? For instance, now reducer one can fetch all the partition 1 from mappers, how I implement that reducer one can fetch all the partition 1 and also 2 to go to reducer 1? If can , How could I… Continue

Posted on September 16, 2009 at 7:04am

Comment Wall (1 comment)

You need to be a member of Hadoop Professionals to add comments!

Join Hadoop Professionals

At 11:50pm on October 5, 2009, Jason VennerJason Venner said…
Is there a reason you want to get multiple map outputs to a single reduce task?
Do you want the data to be fully sorted and grouped by key?

The simplest way is to change the partitioner class so that you get all of the data you want in one single map output.

There is nothing stopping you from creating multiple output files in hdfs or other shared file system, in your map tasks, and passing the names of these files to your reducer via the output collect, or some other mechanism.

You would loose out on the framework handling the sorting for you.

As another alternative that is somewhat io expensive, is to have 2 map/reduce jobs, one of which has only a map output,
the other only a reduce, where the parttioner assigns the reduce task based on the output file form the previous job, such that you get all of the outputs you want in each of your reduces.

The downside of this is that the output data comes from hdfs through the map task back to hdfs
then back into the identity mapper, through the local disk, then http to the reducers.
so you have two extra passes through hdfs and an extra pass through the identitymapper.
 
 
 



Groups

© 2012   Created by Jason Venner.

Badges  |  Report an Issue  |  Terms of Service