Hbase quickly count number of rows

Question

Right now I implement row count over ResultScanner like this  for  Result rs   scanner next    rs    null  rs   scanner next          number        If data reaching millions time computing is large I want to compute in real time that i don t want to use Mapreduce  How to quickly count number of rows

User · Answer

U can find sample example here              Used to get the number of rows of the table         param tableName         param familyNames         return the number of rows         throws IOException             public long countRows String tableName  String    familyNames  throws IOException           long rowCount   0          Configuration configuration   connection getConfiguration               Increase RPC timeout  in case of a slow computation         configuration setLong  hbase rpc timeout   600000              Default is 1  set to a higher value for faster scanner next             configuration setLong  hbase client scanner caching   1000            AggregationClient aggregationClient   new AggregationClient configuration           try               Scan scan   new Scan                if  familyNames    null  amp  amp  familyNames length  gt  0                    for  String familyName   familyNames                        scan addFamily Bytes toBytes familyName                                                rowCount   aggregationClient rowCount TableName valueOf tableName   new LongColumnInterpreter    scan             catch  Throwable e                throw new IOException e                     return rowCount

User · Answer

Use RowCounter in HBase RowCounter is a mapreduce job to count all the rows of a table  This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency  It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit      hbase org apache hadoop hbase mapreduce RowCounter  lt tablename gt   Usage  RowCounter  options        lt tablename gt                      --starttime  start           --endtime  end            --range  startKey   endKey              lt column1 gt   lt column2 gt

User · Answer

If you cannot use RowCounter for whatever reason  then a combination of these two filters should be an optimal way to get a count   FirstKeyOnlyFilter   AND KeyOnlyFilter     The FirstKeyOnlyFilter will result in the scanner only returning the first column qualifier it finds  as opposed to the scanner returning all of the column qualifiers in the table  which will minimize the network bandwith  What about simply picking one column qualifier to return  This would work if you could guarentee that column qualifier exists for every row  but if that is not true then you would get an inaccurate count   The KeyOnlyFilter will result in the scanner only returning the column family  and will not return any value for the column qualifier  This further reduces the network bandwidth  which in the general case wouldn t account for much of a reduction  but there can be an edge case where the first column picked by the previous filter just happens to be an extremely large value   I tried playing around with scan setCaching but the results were all over the place  Perhaps it could help   I had 16 million rows in between a start and stop that I did the following pseudo-empirical testing    With FirstKeyOnlyFilter and KeyOnlyFilter activated       With caching not set  i e   the default value   it took 188 seconds      With caching set to 1  it took 188 seconds     With caching set to 10  it took 200 seconds     With caching set to 100  it took 187 seconds     With caching set to 1000  it took 183 seconds      With caching set to 10000  it took 199 seconds      With caching set to 100000  it took 199 seconds   With FirstKeyOnlyFilter and KeyOnlyFilter disabled       With caching not set   i e   the default value   it took 309 seconds    I didn t bother to do proper testing on this  but it seems clear that the FirstKeyOnlyFilter and KeyOnlyFilter are good   Moreover  the cells in this particular table are very small - so I think the filters would have been even better on a different table     Here is a Java code sample    import java io IOException   import org apache hadoop conf Configuration  import org apache hadoop hbase HBaseConfiguration  import org apache hadoop hbase client HTable  import org apache hadoop hbase client Result  import org apache hadoop hbase client ResultScanner  import org apache hadoop hbase client Scan  import org apache hadoop hbase util Bytes   import org apache hadoop hbase filter RowFilter  import org apache hadoop hbase filter KeyOnlyFilter   import org apache hadoop hbase filter FirstKeyOnlyFilter   import org apache hadoop hbase filter FilterList   import org apache hadoop hbase filter CompareFilter CompareOp  import org apache hadoop hbase filter RegexStringComparator    public class HBaseCount       public static void main String   args  throws IOException           Configuration config   HBaseConfiguration create             HTable table   new HTable config   my table             Scan scan   new Scan              Bytes toBytes  foo    Bytes toBytes  foo                        if  args length    1                scan setCaching Integer valueOf args 0                       System out println  scan s caching is     scan getCaching              FilterList allFilters   new FilterList            allFilters addFilter new FirstKeyOnlyFilter             allFilters addFilter new KeyOnlyFilter              scan setFilter allFilters            ResultScanner scanner   table getScanner scan            int count   0           long start   System currentTimeMillis             try               for  Result rr   scanner next    rr    null  rr   scanner next                      count    1                  if  count   100000    0  System out println count                           finally               scanner close                       long end   System currentTimeMillis             long elapsedTime   end - start           System out println  Elapsed time was      elapsedTime 1000F                  Here is a pychbase code sample        from pychbase import Connection     c   Connection       t   c table  my table         Under the hood this applies the FirstKeyOnlyFilter and KeyOnlyFilter       similar to the happybase example below     print t count row prefix  foo       Here is a Happybase code sample        from happybase import Connection     c   Connection          t   c table  my table       count   0     for   in t scan filter  FirstKeyOnlyFilter   AND KeyOnlyFilter              count    1      print count     Thanks to  Tuckr and  KennyCason for the tip

User · Answer

To count the Hbase table record count on a proper YARN cluster you have to set the map reduce job queue name as well   hbase org apache hadoop hbase mapreduce RowCounter -Dmapreduce job queuename   lt  Your Q Name which you have SUBMIT access gt    lt  TABLE NAME gt

User · Answer

Go to Hbase home directory and run this command     bin hbase org apache hadoop hbase mapreduce RowCounter  namespace tablename   This will launch a mapreduce job and the output will show the number of records existing in  the hbase table

User · Answer

You can use the count method  in hbase to count the number of rows  But yes  counting rows of a large table can be slow count  tablename   interval   Return value is the number of rows   This operation may take a LONG time  Run     HADOOP HOME bin hadoop jar hbase jar rowcount    to run a counting mapreduce job   Current count is shown every 1000 rows by default  Count interval may be optionally specified  Scan caching is enabled on count scans by default  Default cache size is 10 rows  If your rows are small in size  you may want to increase this parameter   Examples   hbase gt  count  t1   hbase gt  count  t1   INTERVAL   gt  100000  hbase gt  count  t1   CACHE   gt  1000  hbase gt  count  t1   INTERVAL   gt  10  CACHE   gt  1000   The same commands also can be run on a table reference  Suppose you had a reference to table  t1   the corresponding commands would be   hbase gt  t count  hbase gt  t count INTERVAL   gt  100000  hbase gt  t count CACHE   gt  1000  hbase gt  t count INTERVAL   gt  10  CACHE   gt  1000

User · Answer

You could try hbase api methods    org apache hadoop hbase client coprocessor AggregationClient

User · Answer

Simple  Effective and Efficient way to count row in HBASE    Whenever you insert a row trigger this API which will increment that particular cell   Htable incrementColumnValue Bytes toBytes  count    Bytes toBytes  details    Bytes toBytes  count    1    To check number of rows present in that table  Just use  Get  or  scan  API for that particular Row  count     By using this Method you can get the row count in less than a millisecond

User · Answer

If you re using a scanner  in your scanner try to have it return the least number of qualifiers as possible   In fact  the qualifier s  that you do return should be the smallest  in byte-size  as you have available   This will speed up your scan tremendously   Unfortuneately this will only scale so far  millions-billions     To take it further  you can do this in real time but you will first need to run a mapreduce job to count all rows   Store the Mapreduce output in a cell in HBase   Every time you add a row  increment the counter by 1   Every time you delete a row  decrement the counter   When you need to access the number of rows in real time  you read that field in HBase     There is no fast way to count the rows otherwise in a way that scales   You can only count so fast

User · Answer

Two ways Worked for me to get count of rows from hbase table with Speed  Scenario  1  If hbase table size is small then login to hbase shell with valid user and execute   gt count   lt tablename gt     Example    gt count  employee   6 row s  in 0 1110 seconds   Scenario  2  If hbase table size is large then execute inbuilt RowCounter map reduce job  Login to hadoop machine with valid user and execute     HBASE HOME bin hbase org apache hadoop hbase mapreduce RowCounter   lt tablename gt     Example      HBASE HOME bin hbase org apache hadoop hbase mapreduce RowCounter  employee                                      Virtual memory  bytes  snapshot 22594633728                 Total committed heap usage  bytes  5093457920         org apache hadoop hbase mapreduce RowCounter RowCounterMapper Counters                 ROWS 6         File Input Format Counters                 Bytes Read 0         File Output Format Counters                 Bytes Written 0

User · Answer

You can use coprocessor what is available since HBase 0 92  See Coprocessor and AggregateProtocol and example

User · Answer

Use the HBase rowcount map reduce job that s included with HBase

[hadoop] Hbase quickly count number of rows

Examples related to hadoop

Examples related to hbase

Examples related to bigdata