List Highest Correlation Pairs from a Large Correlation Matrix in Pandas

Question

How do you find the top correlations in a correlation matrix with Pandas  There are many answers on how to do this with R  Show correlations as an ordered list  not as a large matrix or Efficient way to get highly correlated pairs from large data set in Python or R   but I am wondering how to do it with pandas  In my case the matrix is 4460x4460  so can t do it visually

User · Answer

This is a improve code from  MiFi  This one order in abs but not excluding the negative values      def top correlation  df n       corr matrix   df corr       correlation    corr matrix where np triu np ones corr matrix shape   k 1  astype np bool                     stack                     sort values ascending False       correlation   pd DataFrame correlation  reset index       correlation columns   Variable 1   Variable 2   Correlacion       correlation   correlation reindex correlation Correlacion abs   sort values ascending False  index  reset index   drop   index   axis 1      return correlation head n   top correlation ANYDATA 10

User · Answer

Few lines solution without redundant  pairs of variables  corr matrix   df corr   abs     the matrix is symmetric so we need to extract upper triangle matrix without diagonal  k   1   sol    corr matrix where np triu np ones corr matrix shape   k 1  astype np bool                      stack                      sort values ascending False     first element of sol series is the pair with the biggest correlation  Then you can iterate through names of variables pairs  which are pandas Series multi-indexes  and theirs values like this  for index  value in sol items        do some staff

User · Answer

I was trying some of the solutions here but then I actually came up with my own one  I hope this might be useful for the next one so I share it here   def sort correlation matrix correlation matrix       cor   correlation matrix abs       top col   cor cor columns 0   1       top col   top col sort values ascending False      ordered columns    cor columns 0     top col index tolist       return correlation matrix ordered columns  reindex ordered columns

User · Answer

Combining most the answers above into a short snippet  def top entries df       mat   df corr   abs              Remove duplicate and identity entries     mat loc        np tril mat values  k -1      mat   mat mat gt 0         Unstack  sort ascending  and reset the index  so features are in columns       instead of indexes  allowing e g  a pretty print in Jupyter         Also rename these it for good measure      return  mat unstack                 sort values ascending False                reset index                 rename columns                     quot level 0 quot    quot feature a quot                     quot level 1 quot    quot feature b quot                    0   quot correlation quot

User · Answer

I liked Addison Klinke s post the most  as being the simplest  but used Wojciech Moszczynsk   s suggestion for filtering and charting  but extended the filter to avoid absolute values  so given a large correlation matrix  filter it  chart it  and then flatten it  Created  Filtered and Charted dfCorr   df corr   filteredDf   dfCorr   dfCorr  gt    5     dfCorr  lt   - 5    amp   dfCorr   1 000   plt figure figsize  30 10   sn heatmap filteredDf  annot True  cmap  quot Reds quot   plt show     Function In the end  I created a small function to create the correlation matrix  filter it  and then flatten it  As an idea  it could easily be extended  e g   asymmetric upper and lower bounds  etc  def corrFilter x  pd DataFrame  bound  float       xCorr   x corr       xFiltered   xCorr   xCorr  gt   bound     xCorr  lt   -bound    amp   xCorr   1 000       xFlattened   xFiltered unstack   sort values   drop duplicates       return xFlattened  corrFilter df   7

User · Answer

HYRY s answer is perfect  Just building on that answer by adding a bit more logic to avoid duplicate and self correlations and proper sorting   import pandas as pd d     x1    1  4  4  5  6          x2    0  0  8  2  4          x3    2  8  8  10  12          x4    -1  -4  -4  -4  -5   df   pd DataFrame data   d  print  Data Frame   print df  print    print  Correlation Matrix   print df corr    print    def get redundant pairs df          Get diagonal and lower triangular pairs of correlation matrix        pairs to drop   set       cols   df columns     for i in range 0  df shape 1            for j in range 0  i 1               pairs to drop add  cols i   cols j        return pairs to drop  def get top abs correlations df  n 5       au corr   df corr   abs   unstack       labels to drop   get redundant pairs df      au corr   au corr drop labels labels to drop  sort values ascending False      return au corr 0 n   print  Top Absolute Correlations   print get top abs correlations df  3     That gives the following output   Data Frame    x1  x2  x3  x4 0   1   0   2  -1 1   4   0   8  -4 2   4   8   8  -4 3   5   2  10  -4 4   6   4  12  -5  Correlation Matrix           x1        x2        x3        x4 x1  1 000000  0 399298  1 000000 -0 969248 x2  0 399298  1 000000  0 399298 -0 472866 x3  1 000000  0 399298  1 000000 -0 969248 x4 -0 969248 -0 472866 -0 969248  1 000000  Top Absolute Correlations x1  x3    1 000000 x3  x4    0 969248 x1  x4    0 969248 dtype  float64

User · Answer

Use the code below to view the correlations in the descending order     See the correlations in descending order  corr   df corr     df is the pandas dataframe c1   corr abs   unstack   c1 sort values ascending   False

User · Answer

Use itertools combinations to get all unique correlations from pandas own correlation matrix  corr    generate list of lists and feed it back into a DataFrame in order to use   sort values   Set ascending   True to display lowest correlations on top   corrank takes a DataFrame as argument because it requires  corr       def corrank X  pandas DataFrame           import itertools         df   pd DataFrame    i j  X corr   loc i j   for i j in list itertools combinations X corr    2    columns   pairs   corr                print df sort values by  corr  ascending False      corrank X    prints a descending list of correlation pair  Max on top

User · Answer

You can use DataFrame values to get an numpy array of the data and then use NumPy functions such as argsort   to get the most correlated pairs    But if you want to do this in pandas  you can unstack and sort the DataFrame   import pandas as pd import numpy as np  shape    50  4460   data   np random normal size shape   data    1000     data    2000   df   pd DataFrame data   c   df corr   abs    s   c unstack   so   s sort values kind  quicksort    print so -4470 -4460    Here is the output   2192  1522    0 636198 1522  2192    0 636198 3677  2027    0 641817 2027  3677    0 641817 242   130     0 646760 130   242     0 646760 1171  2733    0 670048 2733  1171    0 670048 1000  2000    0 742340 2000  1000    0 742340 dtype  float64

User · Answer

I didn t want to unstack or over-complicate this issue  since I just wanted to drop some highly correlated features as part of a feature selection phase   So I ended up with the following simplified solution     map features to their absolute correlation values corr   features corr   abs      set equality  self correlation  as zero corr corr    1    0    of each feature  find the max correlation   and sort the resulting array in ascending order corr cols   corr max   sort values ascending False     display the highly correlated features display corr cols corr cols  gt  0 8     In this case  if you want to drop correlated features  you may map through the filtered corr cols array and remove the odd-indexed  or even-indexed  ones

User · Answer

Combining some features of  HYRY and  arun s answers  you can print the top correlations for dataframe df in a single line using   df corr   unstack   sort values   drop duplicates     Note  the one downside is if you have 1 0 correlations that are not one variable to itself  the drop duplicates   addition would remove them

User · Answer

Lot s of good answers here  The easiest way I found was a combination of some of the answers above    corr   corr where np triu np ones corr shape   k 1  astype np bool   corr   corr unstack   transpose         sort values by  column   ascending False        dropna

User · Answer

The following function should do the trick  This implementation   Removes self correlations Removes duplicates Enables the selection of top N highest correlated features   and it is also configurable so that you can keep both the self correlations as well as the duplicates  You can also to report as many feature pairs as you wish      def get feature correlation df  top n None  corr method  spearman                               remove duplicates True  remove self correlations True               Compute the feature correlation and sort feature pairs based on their correlation       param df  The dataframe with the predictor variables      type df  pandas core frame DataFrame      param top n  Top N feature pairs to be reported  if None  all of the pairs will be returned       param corr method  Correlation compuation method      type corr method  str      param remove duplicates  Indicates whether duplicate features must be removed      type remove duplicates  bool      param remove self correlations  Indicates whether self correlations will be removed      type remove self correlations  bool       return  pandas core frame DataFrame             corr matrix abs   df corr method corr method  abs       corr matrix abs us   corr matrix abs unstack       sorted correlated features   corr matrix abs us            sort values kind  quicksort   ascending False             reset index          Remove comparisons of the same feature     if remove self correlations          sorted correlated features   sorted correlated features               sorted correlated features level 0    sorted correlated features level 1                   Remove duplicates     if remove duplicates          sorted correlated features   sorted correlated features iloc  -2 2         Create meaningful names for the columns     sorted correlated features columns     Feature 1    Feature 2    Correlation  abs         if top n          return sorted correlated features  top n       return sorted correlated features

User · Answer

You can do graphically according to this simple code by substituting your data   corr   df corr    kot   corr corr gt   9  plt figure figsize  12 8   sns heatmap kot  cmap  Greens

[python] List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

Examples related to python

Examples related to pandas

Examples related to correlation