Setting the number of map tasks and reduce tasks

Question

I am currently running a job I fixed the number of map task to 20 but and getting a higher number  I also set the reduce task to zero but I am still getting a number other than zero  The total time for the MapReduce job to complete is also not display  Can someone tell me what I am doing wrong  I am using this command  hadoop jar Test Parallel for jar Test Parallel for Matrix test4 txt Result 3   -D mapred map tasks   20   -D mapred reduce tasks  0   Output   11 07 30 19 48 56 INFO mapred JobClient  Job complete  job 201107291018 0164 11 07 30 19 48 56 INFO mapred JobClient  Counters  18 11 07 30 19 48 56 INFO mapred JobClient    Job Counters  11 07 30 19 48 56 INFO mapred JobClient      Launched reduce tasks 13 11 07 30 19 48 56 INFO mapred JobClient      Rack-local map tasks 12 11 07 30 19 48 56 INFO mapred JobClient      Launched map tasks 24 11 07 30 19 48 56 INFO mapred JobClient      Data-local map tasks 12 11 07 30 19 48 56 INFO mapred JobClient    FileSystemCounters 11 07 30 19 48 56 INFO mapred JobClient      FILE BYTES READ 4020792636 11 07 30 19 48 56 INFO mapred JobClient      HDFS BYTES READ 1556534680 11 07 30 19 48 56 INFO mapred JobClient      FILE BYTES WRITTEN 6026699058 11 07 30 19 48 56 INFO mapred JobClient      HDFS BYTES WRITTEN 1928893942 11 07 30 19 48 56 INFO mapred JobClient    Map-Reduce Framework 11 07 30 19 48 56 INFO mapred JobClient      Reduce input groups 40000000 11 07 30 19 48 56 INFO mapred JobClient      Combine output records 0 11 07 30 19 48 56 INFO mapred JobClient      Map input records 40000000 11 07 30 19 48 56 INFO mapred JobClient      Reduce shuffle bytes 1974162269 11 07 30 19 48 56 INFO mapred JobClient      Reduce output records 40000000 11 07 30 19 48 56 INFO mapred JobClient      Spilled Records 120000000 11 07 30 19 48 56 INFO mapred JobClient      Map output bytes 1928893942 11 07 30 19 48 56 INFO mapred JobClient      Combine input records 0 11 07 30 19 48 56 INFO mapred JobClient      Map output records 40000000 11 07 30 19 48 56 INFO mapred JobClient      Reduce input records 40000000  hcrc1425n30 s0907855

User · Answer

Number of map task depends on File size  If you want n  number of Map  divide the file size by n as follows    conf set  mapred max split size    41943040       maximum split file size in bytes conf set  mapred min split size    20971520       minimum split file size in bytes

User · Answer

As Praveen mentions above  when using the basic FileInputFormat classes is just the number of input splits that constitute the data  The number of reducers is controlled by mapred reduce tasks specified in the way you have it  -D mapred reduce tasks 10 would specify 10 reducers  Note that the space after -D is required  if you omit the space  the configuration property is passed along to the relevant JVM  not to Hadoop   Are you specifying 0 because there is no reduce work to do  In that case  if you re having trouble with the run-time parameter  you can also set the value directly in code  Given a JobConf instance job  call  job setNumReduceTasks 0     inside  say  your implementation of Tool run  That should produce output directly from the mappers  If your job actually produces no output whatsoever  because you re using the framework just for side-effects like network calls or image processing  or if the results are entirely accounted for in Counter values   you can disable output by also calling  job setOutputFormat NullOutputFormat class

User · Answer

Folks from this theory it seems we cannot run map reduce jobs in parallel   Lets say I configured total 5 mapper jobs to run on particular node Also I want to use this in such a way that JOB1 can use 3 mappers and JOB2 can use 2 mappers so that job can run in parallel  But above properties are ignored then how can execute jobs in parallel

User · Answer

To explain it with a example   Assume your hadoop input file size is 2 GB and you set block size as 64 MB so 32 Mappers tasks are set to run while each mapper will process 64 MB block to complete the Mapper Job of your Hadoop Job       Number of mappers set to run are completely dependent on 1  File Size and 2  Block Size  Assume you have running hadoop on a cluster size of 4  Assume you set mapred map tasks and mapred reduce tasks parameters in your conf file to the nodes as follows    Node 1  mapred map tasks   4 and mapred reduce tasks   4 Node 2  mapred map tasks   2 and mapred reduce tasks   2 Node 3  mapred map tasks   4 and mapred reduce tasks   4 Node 4  mapred map tasks   1 and mapred reduce tasks   1   Assume you set the above paramters for 4 of your nodes in this cluster  If you notice Node 2 has set only 2 and 2 respectively because the processing resources of the Node 2 might be less e g 2 Processors  2 Cores  and Node 4 is even set lower to just 1 and 1 respectively might be due to processing resources on that node is 1 processor  2 cores so can t run more than 1 mapper and 1 reducer task   So when you run the job Node 1  Node 2  Node 3  Node 4 are configured to run a max  total of  4 2 4 1 11 mapper tasks simultaneously out of 42 mapper tasks that needs to be completed by the Job  After each Node completes its map tasks it will take the remaining mapper tasks left in 42 mapper tasks   Now comming to reducers  as you set mapred reduce tasks   0 so we only get mapper output in to 42 files 1 file for each mapper task  and no reducer output

User · Answer

It s important to keep in mind that the MapReduce framework in Hadoop allows us only to      suggest the number of Map tasks for a job   which like Praveen pointed out above will correspond to the number of input splits for the task  Unlike it s behavior for the number of reducers  which is directly related to the number of files output by the MapReduce job  where we can      demand that it provide n reducers

User · Answer

The first part has already been answered   just a suggestion  The second part has also been answered   remove extra spaces around     If both these didnt work  are you sure you have implemented ToolRunner

User · Answer

From what I understand reading above  it depends on the input files  If Input Files are 100 means - Hadoop will create 100 map tasks  However  it depends on the Node configuration on How Many can be run at one point of time  If a node is configured to run 10 map tasks - only 10 map tasks will run in parallel by picking 10 different input files out of the 100 available  Map tasks will continue to fetch more files as and when it completes processing of a file

User · Answer

I agree the number mapp task depends upon the input split but in some of the scenario i could see its little different   case-1  I created a simple mapp task only it creates 2 duplicate out put file  data ia same      command I gave below  bin hadoop jar contrib streaming hadoop-streaming-1 2 1 jar -D mapred reduce tasks 0 -input  home sample csv -output  home sample csv112 txt -mapper  home amitav workpython readcsv py  Case-2 So I restrcted the mapp task to 1  the out put came correctly with one output file but one reducer also lunched in the UI screen although I restricted the reducer job  The command is given below   bin hadoop jar contrib streaming hadoop-streaming-1 2 1 jar -D mapred map tasks 1  mapred reduce tasks 0 -input  home sample csv -output  home sample csv115 txt -mapper  home amitav workpython readcsv py

User · Answer

The number of map tasks for a given job is driven by the number of input splits and not by the mapred map tasks parameter  For each input split a map task is spawned  So  over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits  mapred map tasks is just a hint to the InputFormat for the number of maps   In your example Hadoop has determined there are 24 input splits and will spawn 24 map tasks in total  But  you can control how many map tasks can be executed in parallel by each of the task tracker   Also  removing a space after -D might solve the problem for reduce   For more information on the number of map and reduce tasks  please look at the below url  https   cwiki apache org confluence display HADOOP2 HowManyMapsAndReduces

User · Answer

In the newer version of Hadoop  there are much more granular mapreduce job running map limit and mapreduce job running reduce limit which allows you to set the mapper and reducer count irrespective of hdfs file split size  This is helpful if you are under constraint to not take up large resources in the cluster   JIRA

User · Answer

Use -D property value rather than -D property   value  eliminate extra whitespaces    Thus -D mapred reduce tasks value would work fine   Setting number of map tasks doesnt always reflect the value you have  set since it depends      on split size and InputFormat used  Setting the number of reduces will definitely override the number of  reduces set on cluster client-side configuration

User · Answer

One way you can increase the number of mappers is to give your input in the form of split files  you can use linux split command   Hadoop streaming usually assigns that many mappers as there are input files if there are a large number of files  if not it will try to split the input into equal sized parts

User · Answer

In your example  the -D parts are not picked up   hadoop jar Test Parallel for jar Test Parallel for Matrix test4 txt Result 3   -D mapred map tasks   20   -D mapred reduce tasks  0   They should come after the classname part like this   hadoop jar Test Parallel for jar Test Parallel for -Dmapred map tasks 20 -Dmapred reduce tasks 0 Matrix test4 txt Result 3   A space after -D is allowed though   Also note that changing the number of mappers is probably a bad idea as other people have mentioned here

User · Answer

Number of map tasks is directly defined by number of chunks your input is splitted  The size of data chunk  i e  HDFS block size  is controllable and can be set for an individual file  set of files  directory -s   So  setting specific number of map tasks in a job is possible but involves setting a corresponding HDFS block size for job s input data  mapred map tasks can be used for that too but only if its provided value is greater than number of splits for job s input data   Controlling number of reducers via mapred reduce tasks is correct  However  setting it to zero is a rather special case  the job s output is an concatenation of mappers  outputs  non-sorted   In Matt s answer one can see more ways to set the number of reducers

User · Answer

From your log I understood that you have 12 input files as there are 12 local maps generated  Rack Local maps are spawned for the same file if some of the blocks of that file are in some other data node  How many data nodes you have

[hadoop] Setting the number of map tasks and reduce tasks

Examples related to hadoop

Examples related to mapreduce