PIG how to count a number of rows in alias

Question

I did something like this to count the number of rows in an alias in PIG   logs   LOAD  log  logs w one   foreach logs generate 1 as one  logs group   group logs w one all  logs count   foreach logs group generate SUM logs w one one   dump logs count    This seems to be too inefficient  Please enlighten me if there is a better way

User · Answer

Here is a version with optimization   All the solutions above would require pig to read and write full tuple when counting  this script below just write  1 -s  DEFINE row count inBag  name  RETURNS result       X   FOREACH  inBag generate 1       result   FOREACH  GROUP X ALL PARALLEL 1  GENERATE   name   COUNT X        The use it like  xxx   row count rows   rows count

User · Answer

Basic counting is done as was stated in other answers  and in the pig documentation   logs   LOAD  log   all logs in a bag   GROUP logs ALL  log count   FOREACH all logs in a bag GENERATE COUNT logs   dump log count   You are right that counting is inefficient  even when using pig s builtin COUNT because this will use one reducer  However  I had a revelation today that one of the ways to speed it up would be to reduce the RAM utilization of the relation we re counting   In other words  when counting a relation  we don t actually care about the data itself so let s use as little RAM as possible  You were on the right track with your first iteration of the count script   logs   LOAD  log  ones   FOREACH logs GENERATE 1 AS one int  counter group   GROUP ones ALL  log count   FOREACH counter group GENERATE COUNT ones   dump log count   This will work on much larger relations than the previous script and should be much faster  The main difference between this and your original script is that we don t need to sum anything   This also doesn t have the same problem as other solutions where null values would impact the count  This will count all the rows  regardless of if the first column is null or not

User · Answer

COUNT is part of pig see the manual  LOGS  LOAD  log   LOGS GROUP  GROUP LOGS ALL  LOG COUNT   FOREACH LOGS GROUP GENERATE COUNT LOGS

User · Answer

What you want is to count all the lines in a relation  dataset in Pig Latin   This is very easy following the next steps   logs   LOAD  log   --relation called logs  using PigStorage with tab as field delimiter logs grouped   GROUP logs ALL --gives a relation with one row with logs as a bag number   FOREACH LOGS GROUP GENERATE COUNT STAR logs  --show me the number   I have to say it is important Kevin s point as using COUNT instead of COUNT STAR we would have only the number of lines which first field is not null   Also I like Jerome s one line syntax it is more concise but in order to be didactic I prefer to divide it in two and add some comment   In general I prefer   numerito   FOREACH  GROUP CARGADOS3 ALL  GENERATE COUNT STAR CARGADOS3     over  name   GROUP CARGADOS3 ALL number   FOREACH name GENERATE COUNT STAR CARGADOS3

User · Answer

Be careful  with COUNT your first item in the bag must not be null  Else you can use the function COUNT STAR to count all rows

User · Answer

Arnon Rotem-Gal-Oz already answered this question a while ago  but I thought some may like this slightly more concise version   LOGS   LOAD  log   LOG COUNT   FOREACH  GROUP LOGS ALL  GENERATE COUNT LOGS

User · Answer

USE COUNT STAR  LOGS  LOAD  log   LOGS GROUP  GROUP LOGS ALL  LOG COUNT   FOREACH LOGS GROUP GENERATE COUNT STAR LOGS

[hadoop] PIG how to count a number of rows in alias

Examples related to hadoop

Examples related to apache-pig