I want to know for a job If I have know the datasize, how many reducers do I need to set? How to count the suitable value of reducers number in my program? Is there any formulation of it? Thanks!
Generally speaking the number of reducers you choose is dependent on what your are going to do with the final output, the reduce capacity of your cluster, the amount of data needing to be reduced, and the time needed to perform the reduce.
For me I usually either set my reducers to the reduce capacity of my cluster, unless my cluster is very large, or I need a very specific number of output files, the usual case of a specific number being 1.
At the current time there are only rough guidelines as peoples hardware and data flows vary so substantially.