GroupBy pandas DataFrame and select most common value

Question

I have a data frame with three string columns  I know that the only one value in the 3rd column is valid for every combination of the first two  To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination   My code   import pandas as pd from scipy import stats  source   pd DataFrame   Country      USA    USA    Russia   USA                        City      New-York    New-York    Sankt-Petersburg    New-York                       Short name      NY   New   Spb   NY      print source groupby   Country   City    agg lambda x  stats mode x  Short name    0     Last line of code doesn t work  it says  Key error  Short name   and if I try to group only by City  then I got an AssertionError  What can I do fix it

User · Answer

The problem here is the performance  if you have a lot of rows it will be a problem    If it is your case  please try with this   import pandas as pd  source   pd DataFrame   Country      USA    USA    Russia   USA                    City      New-York    New-York    Sankt-Petersburg    New-York                   Short name      NY   New   Spb   NY      source groupby   Country   City    agg lambda x x value counts   index 0    source groupby   Country   City    Short name value counts   groupby  Country   City    first

User · Answer

Pandas    0 16  pd Series mode is available   Use groupby  GroupBy agg  and apply the pd Series mode function to each group   source groupby   Country   City     Short name   agg pd Series mode   Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name  Short name  dtype  object   If this is needed as a DataFrame  use  source groupby   Country   City     Short name   agg pd Series mode  to frame                             Short name Country City                        Russia  Sankt-Petersburg        Spb USA     New-York                 NY   The useful thing about Series mode is that it always returns a Series  making it very compatible with agg and apply  especially when reconstructing the groupby output  It is also faster     Accepted answer   timeit source groupby   Country   City    agg lambda x x value counts   index 0     Proposed in this post   timeit source groupby   Country   City     Short name   agg pd Series mode   5 56 ms    343   s per loop  mean    std  dev  of 7 runs  100 loops each  2 76 ms    387   s per loop  mean    std  dev  of 7 runs  100 loops each      Dealing with Multiple Modes  Series mode also does a good job when there are multiple modes   source2   source append      pd Series   Country    USA    City    New-York    Short name    New         ignore index True     Now  source2  has two modes for the      USA    New-York   group  they are  NY  and  New   source2    Country              City Short name 0     USA          New-York         NY 1     USA          New-York        New 2  Russia  Sankt-Petersburg        Spb 3     USA          New-York         NY 4     USA          New-York        New     source2 groupby   Country   City     Short name   agg pd Series mode   Country  City             Russia   Sankt-Petersburg          Spb USA      New-York             NY  New  Name  Short name  dtype  object   Or  if you want a separate row for each mode  you can use GroupBy apply   source2 groupby   Country   City     Short name   apply pd Series mode   Country  City                Russia   Sankt-Petersburg  0    Spb USA      New-York          0     NY                            1    New Name  Short name  dtype  object   If you don t care which mode is returned as long as it s either one of them  then you will need a lambda that calls mode and extracts the first result   source2 groupby   Country   City     Short name   agg      lambda x  pd Series mode x  0    Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name  Short name  dtype  object     Alternatives to  not  consider  You can also use statistics mode from python  but     source groupby   Country   City     Short name   apply statistics mode   Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name  Short name  dtype  object      it does not work well when having to deal with multiple modes  a StatisticsError is raised  This is mentioned in the docs      If data is empty  or if there is not exactly one most common value    StatisticsError is raised    But you can see for yourself     statistics mode  1  2     ---------------------------------------------------------------------------   StatisticsError                           Traceback  most recent call last          StatisticsError  no unique mode  found 2 equally common values

User · Answer

Formally  the correct answer is the  eumiro Solution  The problem of  HYRY solution is that when you have a sequence of numbers like  1 2 3 4  the solution is wrong  i  e   you don t have the mode  Example    gt  gt  gt  import pandas as pd  gt  gt  gt  df   pd DataFrame                         client     A    B    A    B    B    C    A    D    D    E    E    E    E    E    A                  total    1  4  3  2  4  1  2  3  5  1  2  2  2  3  4                 bla    10  40  30  20  40  10  20  30  50  10  20  20  20  30  40                    If you compute like  HYRY you obtain    gt  gt  gt  print df groupby   client    agg lambda x  x value counts   index 0            total  bla client             A           4   30 B           4   40 C           1   10 D           3   30 E           2   20   Which is clearly wrong  see the A value that should be 1 and not 4  because it can t handle with unique values   Thus  the other solution is correct    gt  gt  gt  import scipy stats  gt  gt  gt  print df groupby   client    agg lambda x  scipy stats mode x  0  0            total  bla client             A           1   10 B           4   40 C           1   10 D           3   30 E           2   20

User · Answer

The two top answers here suggest   df groupby cols  agg lambda x x value counts   index 0     or  preferably  df groupby cols  agg pd Series mode    However both of these fail in simple edge cases  as demonstrated here   df   pd DataFrame        client id    A    A    A    A    B    B    B    C         date    2019-01-01    2019-01-01    2019-01-01    2019-01-01    2019-01-01    2019-01-01    2019-01-01    2019-01-01         location    NY    NY    LA    LA    DC    DC    LA   np NaN       The first   df groupby   client id    date    agg lambda x x value counts   index 0     yields IndexError  because of the empty Series returned by group C   The second   df groupby   client id    date    agg pd Series mode    returns ValueError  Function does not reduce  since the first group returns a list of two  since there are two modes    As documented here  if the first group returned a single mode this would work    Two possible solutions for this case are   import scipy x groupby   client id    date    agg lambda x  scipy stats mode x  0     And the solution given to me by cs95 in the comments here   def foo x        m   pd Series mode x        return m values 0  if not m empty else np nan df groupby   client id    date    agg foo    However  all of these are slow and not suited for large datasets  A solution I ended up using which a  can deal with these cases and b  is much  much faster  is a lightly modified version of abw33 s answer  which should be higher    def get mode per column dataframe  group cols  col       return  dataframe fillna -1     NaN placeholder to keep group               groupby group cols    col                size                to frame  count                reset index                sort values  count   ascending False               drop duplicates subset group cols               drop columns   count                 sort values group cols               replace -1  np NaN      restore NaNs  group cols     client id    date       non grp cols   list set df  difference group cols   output df   get mode per column df  group cols  non grp cols 0   set index group cols  for col in non grp cols 1        output df col    get mode per column df  group cols  col  col  values   Essentially  the method works on one col at a time and outputs a df  so instead of concat  which is intensive  you treat the first as a df  and then iteratively add the output array  values flatten    as a column in the df

User · Answer

If you want another approach for solving it that is does not depend on value counts or scipy stats you can use the Counter collection  from collections import Counter get most common   lambda values  max Counter values  items    key   lambda x  x 1   0    Which can be applied to the above example like this  src   pd DataFrame   Country      USA    USA    Russia   USA                    City      New-York    New-York    Sankt-Petersburg    New-York                   Short name      NY   New   Spb   NY      src groupby   Country   City    agg get most common

User · Answer

You can use value counts   to get a count series  and get the first row   import pandas as pd  source   pd DataFrame   Country      USA    USA    Russia   USA                        City      New-York    New-York    Sankt-Petersburg    New-York                       Short name      NY   New   Spb   NY      source groupby   Country   City    agg lambda x x value counts   index 0     In case you are wondering about performing other agg functions in the  agg   try this     Let s add a new col   account source  account      1 2 3 3   source groupby   Country   City    agg mod      Short name                                             lambda x  x value counts   index 0                                            avg     account    mean

User · Answer

For agg  the lambba function gets a Series  which does not have a  Short name  attribute   stats mode returns a tuple of two arrays  so you have to take the first element of the first array in this tuple   With these two simple changements   source groupby   Country   City    agg lambda x  stats mode x  0  0     returns                           Short name Country City                        Russia  Sankt-Petersburg        Spb USA     New-York                 NY

User · Answer

If you don t want to include NaN values  using Counter is much much faster than pd Series mode or pd Series value counts   0    def get most common srs       x   list srs      my counter   Counter x      return my counter most common 1  0  0   df groupby col  agg get most common    should work  This will fail when you have NaN values  as each NaN will be counted separately

User · Answer

A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest  sorting the counts highest to lowest  and then de-duplicating on a subset to only retain the largest cases  The code example is following    gt  gt  gt  import pandas as pd  gt  gt  gt  source   pd DataFrame                         Country     USA    USA    Russia    USA                  City     New-York    New-York    Sankt-Petersburg    New-York                 Short name     NY    New    Spb    NY                    gt  gt  gt  grouped df   source           groupby   Country   City   Short name      Short name              count             rename columns   Short name   count              reset index             sort values  count   ascending False            drop duplicates subset   Country    City              drop  count   axis 1   gt  gt  gt  print grouped df    Country              City Short name 1     USA          New-York         NY 0  Russia  Sankt-Petersburg        Spb

User · Answer

A little late to the game here  but I was running into some performance issues with HYRY s solution  so I had to come up with another one   It works by finding the frequency of each key-value  and then  for each key  only keeping the value that appears with it most often   There s also an additional solution that supports multiple modes   On a scale test that s representative of the data I m working with  this reduced runtime from 37 4s to 0 5s   Here s the code for the solution  some example usage  and the scale test   import numpy as np import pandas as pd import random import time  test input   pd DataFrame columns    key             value                              data      1                A                                            1                B                                            1                B                                            1               np nan                                        2               np nan                                        3                C                                            3                C                                            3                D                                            3                D          def mode df  key cols  value col  count col                                                                                                                                                                                                                                                                                                                                                                             Pandas does not provide a  mode  aggregation function                                                                                                                                                                                                                                                                                                                 for its  GroupBy  objects  This function is meant to fill                                                                                                                                                                                                                                                                                                             that gap  though the semantics are not exactly the same                                                                                                                                                                                                                                                                                                                The input is a DataFrame with the columns  key cols                                                                                                                                                                                                                                                                                                                   that you would like to group on  and the column                                                                                                                                                                                                                                                                                                                        value col  for which you would like to obtain the mode                                                                                                                                                                                                                                                                                                                The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                           null values are not counted   The  key cols  are included as columns   value col                                                                                                                                                                                                                                                                                     contains a mode  ties are broken arbitrarily and deterministically  for each                                                                                                                                                                                                                                                                                          group  and  count col  indicates how many times each mode appeared in its group                                                                                                                                                                                                                                                                                               return df groupby key cols    value col   size                   to frame count col  reset index                   sort values count col  ascending False                  drop duplicates subset key cols   def modes df  key cols  value col  count col                                                                                                                                                                                                                                                                                                                                                                             Pandas does not provide a  mode  aggregation function                                                                                                                                                                                                                                                                                                                 for its  GroupBy  objects  This function is meant to fill                                                                                                                                                                                                                                                                                                             that gap  though the semantics are not exactly the same                                                                                                                                                                                                                                                                                                                The input is a DataFrame with the columns  key cols                                                                                                                                                                                                                                                                                                                   that you would like to group on  and the column                                                                                                                                                                                                                                                                                                                        value col  for which you would like to obtain the modes                                                                                                                                                                                                                                                                                                               The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                                   one mode  null values are not counted   The  key cols  are included as                                                                                                                                                                                                                                                                                                columns   value col  contains lists indicating the modes for each group                                                                                                                                                                                                                                                                                               and  count col  indicates how many times each mode appeared in its group                                                                                                                                                                                                                                                                                                      return df groupby key cols    value col   size                   to frame count col  reset index                   groupby key cols    count col   value col  unique                   to frame   reset index                   sort values count col  ascending False                  drop duplicates subset key cols   print test input print mode test input    key     value    count   print modes test input    key     value    count    scale test data     random randint 1  100000                       str random randint 123456789001  123456789100    for i in range 1000000   scale test input   pd DataFrame columns   key    value                                    data scale test data   start   time time   mode scale test input    key     value    count   print time time   - start  start   time time   modes scale test input    key     value    count   print time time   - start  start   time time   scale test input groupby   key    agg lambda x  x value counts   index 0   print time time   - start   Running this code will print something like      key value 0    1     A 1    1     B 2    1     B 3    1   NaN 4    2   NaN 5    3     C 6    3     C 7    3     D 8    3     D    key value  count 1    1     B      2 2    3     C      2    key  count   value 1    1      2      B  2    3      2   C  D  0 489614009857 9 19386196136 37 4375009537   Hope this helps

[python] GroupBy pandas DataFrame and select most common value

Examples related to python

Examples related to pandas

Examples related to group-by

Examples related to pandas-groupby

Examples related to mode