Comparing two dataframes and getting the differences

Question

I have two dataframes   Examples   df1  Date       Fruit  Num  Color  2013-11-24 Banana 22 1 Yellow 2013-11-24 Orange  8 6 Orange 2013-11-24 Apple   7 6 Green 2013-11-24 Celery 10 2 Green  df2  Date       Fruit  Num  Color  2013-11-24 Banana 22 1 Yellow 2013-11-24 Orange  8 6 Orange 2013-11-24 Apple   7 6 Green 2013-11-24 Celery 10 2 Green 2013-11-25 Apple  22 1 Red 2013-11-25 Orange  8 6 Orange   Each dataframe has the Date as an index  Both dataframes have the same structure    What i want to do  is compare these two dataframes and find which rows are in df2 that aren t in df1  I want to compare the date  index  and the first column  Banana  APple  etc  to see if they exist in df2 vs df1   I have tried the following    Outputting difference in two Pandas dataframes side by side - highlighting the difference Comparing two pandas dataframes for differences   For the first approach I get this error   Exception  Can only compare identically-labeled DataFrame objects    I have tried removing the Date as index but get the same error   On the third approach  I get the assert to return False but cannot figure out how to actually see the different rows   Any pointers would be welcome

User · Answer

One important detail to notice is that your data has duplicate index values, so to perform any straightforward comparison we need to turn everything as unique with df.reset_index() and therefore we can perform selections based on conditions. Once in your case the index is defined, I assume that you would like to keep de index so there are a one-line solution:

[~df2.reset_index().isin(df1.reset_index())].dropna().set_index('Date')

Once the objective from a pythonic perspective is to improve readability, we can break a little bit:

# keep the index name, if it does not have a name it uses the default name
index_name = df.index.name if df.index.name else 'index' 

# setting the index to become unique
df1 = df1.reset_index()
df2 = df2.reset_index()

# getting the differences to a Dataframe
df_diff = df2[~df2.isin(df1)].dropna().set_index(index_name)

User · Answer

Passing the dataframes to concat in a dictionary  results in a multi-index dataframe from which you can easily delete the duplicates  which results in a multi-index dataframe with the differences between the dataframes   import sys if sys version info 0   lt  3      from StringIO import StringIO else      from io import StringIO import pandas as pd  DF1   StringIO    Date       Fruit  Num  Color  2013-11-24 Banana 22 1 Yellow 2013-11-24 Orange  8 6 Orange 2013-11-24 Apple   7 6 Green 2013-11-24 Celery 10 2 Green      DF2   StringIO    Date       Fruit  Num  Color  2013-11-24 Banana 22 1 Yellow 2013-11-24 Orange  8 6 Orange 2013-11-24 Apple   7 6 Green 2013-11-24 Celery 10 2 Green 2013-11-25 Apple  22 1 Red 2013-11-25 Orange  8 6 Orange       df1   pd read table DF1  sep   s    df2   pd read table DF2  sep   s        dfs dictionary     DF1  df1  DF2  df2  df pd concat dfs dictionary  df drop duplicates keep False    Result                Date   Fruit   Num   Color DF2 4  2013-11-25   Apple  22 1     Red     5  2013-11-25  Orange   8 6  Orange

User · Answer

given df1 pd DataFrame   Date    2013-11-24   2013-11-24   2013-11-24   2013-11-24         Fruit    Banana   Orange   Apple   Celery         Num   22 1 8 6 7 6 10 2        Color    Yellow   Orange   Green   Green     df2 pd DataFrame   Date    2013-11-24   2013-11-24   2013-11-24   2013-11-24   2013-11-25   2013-11-25         Fruit    Banana   Orange   Apple   Celery   Apple   Orange         Num   22 1 8 6 7 6 1000 22 1 8 6        Color    Yellow   Orange   Green   Green   Red   Orange        find which rows are in df2 that aren t in df1 by Date and Fruit df 2notin1   df2   df2  Date   isin df1  Date     amp  df2  Fruit   isin df1  Fruit       dropna   reset index drop True     output print  df 2notin1 n   df 2notin1         Color        Date   Fruit   Num   0     Red  2013-11-25   Apple  22 1   1  Orange  2013-11-25  Orange   8 6

User · Answer

Since pandas  gt   1 1 0 we have DataFrame compare and Series compare   Note  the method can only compare identically-labeled DataFrame objects  this means DataFrames with identical row and column labels   df1   pd DataFrame   A    1  2  3                        B    4  5  6                        C    7  np NaN  9     df2   pd DataFrame   A    1  99  3                        B    4  5  81                        C    7  8  9        A  B    C 0  1  4  7 0 1  2  5  NaN 2  3  6  9 0       A   B  C 0   1   4  7 1  99   5  8 2   3  81  9  df1 compare df2        A          B          C         self other self other self other 1  2 0  99 0  NaN   NaN  NaN   8 0 2  NaN   NaN  6 0  81 0  NaN   NaN

User · Answer

I got this solution  Does this help you    text      df1  2013-11-24 Banana 22 1 Yellow 2013-11-24 Orange 8 6 Orange 2013-11-24 Apple 7 6 Green 2013-11-24 Celery 10 2 Green  df2  2013-11-24 Banana 22 1 Yellow 2013-11-24 Orange 8 6 Orange 2013-11-24 Apple 7 6 Green 2013-11-24 Celery 10 2 Green 2013-11-25 Apple 22 1 Red 2013-11-25 Orange 8 6 Orange    argetz45 2013-11-24 Banana 22 1 Yellow 2013-11-24 Orange 118 6 Orange 2013-11-24 Apple 74 6 Green 2013-11-24 Celery 10 2 Green 2013-11-25     Nuts    45 8 Brown 2013-11-25 Apple 22 1 Red 2013-11-25 Orange 8 6 Orange 2013-11-26   Pear 102 54    Pale         from collections import OrderedDict import re  r   re compile    a-zA-Z d      n                   20 d d- 01  d- 0123  d   n                       n                            n   Z                                            n  a-zA-Z d     n                     20 d d- 01  d- 0123  d     r2   re compile    20 d d- 01  d- 0123  d       d       lt       n       d   OrderedDict   bef       for m in r finditer text       li          for x in r2 findall m group 2            if not any x 1 3   elbef for elbef in bef               bef append x 1 3               li append x 0       d m group 1     li   for name lu in d iteritems        print   s n s n     name   n  join lu     result  df1 2013-11-24 Banana 22 1 Yellow 2013-11-24 Orange 8 6 Orange 2013-11-24 Apple 7 6 Green 2013-11-24 Celery 10 2 Green  df2 2013-11-25 Apple 22 1 Red 2013-11-25 Orange 8 6 Orange  argetz45 2013-11-25     Nuts    45 8 Brown 2013-11-26   Pear 102 54    Pale

User · Answer

Founder a simple solution here   https   stackoverflow com a 47132808 9656339  pd concat  df1  df2   loc df1 index symmetric difference df2 index

User · Answer

THIS WORK FOR ME    Get all diferent values df3   pd merge df1  df2  how  outer   indicator  Exist   df3   df3 loc df3  Exist       both       If you like to filter by a common ID df3    pd merge df1  df2  on  Fruit   how  outer   indicator  Exist   df3    df3 loc df3  Exist       both

User · Answer

Building on alko s answer that almost worked for me  except for the filtering step  where I get  ValueError  cannot reindex from a duplicate axis   here is the final solution I used     join the dataframes united data   pd concat  data1  data2  data3          group the data by the whole row to find duplicates united data grouped   united data groupby list united data columns     detect the row indices of unique rows uniq data idx    x 0  for x in united data grouped indices values   if len x     1    extract those unique values uniq data   united data iloc uniq data idx

User · Answer

There is a simpler solution that is faster and better   and if the numbers are different can even give you quantities differences   df1 i   df1 set index   Date   Fruit   Color    df2 i   df2 set index   Date   Fruit   Color    df diff   df1 i join df2 i how  outer  rsuffix      fillna 0  df diff    df diff  Num   - df diff  Num       Here df diff is a synopsis of the differences  You can even use it to find the differences in quantities  In your example     Explanation  Similarly to comparing two lists  to do it efficiently we should first order them then compare them  converting the list to sets hashing would also be fast  both are an incredible improvement to the simple O N 2  double comparison loop  Note  the following code produces the tables   df1 pd DataFrame        Date    2013-11-24   2013-11-24   2013-11-24   2013-11-24         Fruit    Banana   Orange   Apple   Celery         Num   22 1 8 6 7 6 10 2        Color    Yellow   Orange   Green   Green       df2 pd DataFrame        Date    2013-11-24   2013-11-24   2013-11-24   2013-11-24   2013-11-25   2013-11-25         Fruit    Banana   Orange   Apple   Celery   Apple   Orange         Num   22 1 8 6 7 6 10 2 22 1 8 6        Color    Yellow   Orange   Green   Green   Red   Orange

User · Answer

Hope this would be useful to you   o   df1   pd DataFrame   date     0207    0207     col1    1  2    df2   pd DataFrame   date     0207    0207    0208    0208     col1    1  2  3  4    print f df1 Before   n df1  ndf2  n df2        df1 Before      date  col1 0  0207     1 1  0207     2  df2     date  col1 0  0207     1 1  0207     2 2  0208     3 3  0208     4      old set   set df1 index values  new set   set df2 index values  new data index   new set - old set new data list      for idx in new data index      new data list append df2 loc idx    if len new data list   gt  0      df1   df1 append new data list  print f df1 After   n df1        df1 After      date  col1 0  0207     1 1  0207     2 2  0208     3 3  0208     4

User · Answer

I tried this method  and it worked  I hope it can help too      Identify differences between two pandas DataFrames    df1 sort index inplace True  df2 sort index inplace True  df all   pd concat  df1  df12   axis  columns   keys   First    Second    df final   df all swaplevel axis  columns   df1 columns 1    df final df final  change this to one of the columns      df final  change this to one of the columns

User · Answer

Updating and placing  somewhere it will be easier for others to find  ling s comment upon jur s response above  df diff   pd concat  df1 df2   drop duplicates keep False    Testing with these DataFrames    with import pandas as pd  df1   pd DataFrame        Date    2013-11-24   2013-11-24   2013-11-24   2013-11-24         Fruit    Banana   Orange   Apple   Celery         Num   22 1 8 6 7 6 10 2        Color    Yellow   Orange   Green   Green            df2   pd DataFrame        Date    2013-11-24   2013-11-24   2013-11-24   2013-11-24   2013-11-25   2013-11-25         Fruit    Banana   Orange   Apple   Celery   Apple   Orange         Num   22 1 8 6 7 6 10 2 22 1 8 6        Color    Yellow   Orange   Green   Green   Red   Orange             Results in this    for df1           Date   Fruit   Num   Color 0  2013-11-24  Banana  22 1  Yellow 1  2013-11-24  Orange   8 6  Orange 2  2013-11-24   Apple   7 6   Green 3  2013-11-24  Celery  10 2   Green     for df2           Date   Fruit   Num   Color 0  2013-11-24  Banana  22 1  Yellow 1  2013-11-24  Orange   8 6  Orange 2  2013-11-24   Apple   7 6   Green 3  2013-11-24  Celery  10 2   Green 4  2013-11-25   Apple  22 1     Red 5  2013-11-25  Orange   8 6  Orange     for df diff           Date   Fruit   Num   Color 4  2013-11-25   Apple  22 1     Red 5  2013-11-25  Orange   8 6  Orange

User · Answer

This approach  df1    df2  works only for  dataframes  with identical rows and columns  In fact  all dataframes axes are compared with  indexed same method  and exception is raised if differences found  even in columns indices order   If I got you right  you want not to find changes  but symmetric difference  For that  one approach might be concatenate dataframes    gt  gt  gt  df   pd concat  df1  df2    gt  gt  gt  df   df reset index drop True    group by    gt  gt  gt  df gpby   df groupby list df columns     get index of unique records   gt  gt  gt  idx    x 0  for x in df gpby groups values   if len x     1    filter   gt  gt  gt  df reindex idx           Date   Fruit   Num   Color 9  2013-11-25  Orange   8 6  Orange 8  2013-11-25   Apple  22 1     Red

[python] Comparing two dataframes and getting the differences

Examples related to python

Examples related to pandas

Examples related to dataframe