Pandas aggregate count distinct

Question

Let s say I have a log of user activity and I want to generate a report of total duration and the number of unique users per day   import numpy as np import pandas as pd df   pd DataFrame   date     2013-04-01   2013-04-01   2013-04-01   2013-04-02    2013-04-02         user id     0001    0001    0002    0002    0002         duration    30  15  20  15  30      Aggregating duration is pretty straightforward   group   df groupby  date   agg   group aggregate   duration   np sum   agg             duration date 2013-04-01        65 2013-04-02        45   What I d like to do is sum the duration and count distincts at the same time  but I can t seem to find an equivalent for count distinct   agg   group aggregate    duration   np sum   user id   count distinct     This works  but surely there s a better way  no   group   df groupby  date   agg   group aggregate   duration   np sum   agg  uv     df groupby  date   user id nunique   agg             duration  uv date 2013-04-01        65   2 2013-04-02        45   1   I m thinking I just need to provide a function that returns the count of distinct items of a Series object to the aggregate function  but I don t have a lot of exposure to the various libraries at my disposal  Also  it seems that the groupby object already knows this information  so wouldn t I just be duplicating effort

User · Answer

Just adding to the answers already given  the solution using the string  nunique  seems much faster  tested here on  21M rows dataframe  then grouped to  2M    time   g agg   id   lambda x  x nunique     CPU times  user 3min 3s  sys  2 94 s  total  3min 6s Wall time  3min 20s   time   g agg   id   pd Series nunique   CPU times  user 3min 2s  sys  2 44 s  total  3min 4s Wall time  3min 18s   time   g agg   id    nunique    CPU times  user 14 s  sys  4 76 s  total  18 8 s Wall time  24 4 s

User · Answer

nunique  is an option for  agg   since pandas 0 20 0  so   df groupby  date   agg   duration    sum    user id    nunique

User · Answer

How about either of    gt  gt  gt  df          date  duration user id 0  2013-04-01        30    0001 1  2013-04-01        15    0001 2  2013-04-01        20    0002 3  2013-04-02        15    0002 4  2013-04-02        30    0002  gt  gt  gt  df groupby  date   agg   duration   np sum   user id   pd Series nunique               duration  user id date                          2013-04-01        65        2 2013-04-02        45        1  gt  gt  gt  df groupby  date   agg   duration   np sum   user id   lambda x  x nunique                 duration  user id date                          2013-04-01        65        2 2013-04-02        45        1

[python] Pandas aggregate count distinct

The answer is

Examples related to python

Examples related to pandas

Tags