Use corr to get the correlation between two columns

Question

I have the following pandas dataframe Top15        I create a column that estimates the number of citable documents per person   Top15  PopEst     Top15  Energy Supply     Top15  Energy Supply per Capita   Top15  Citable docs per Capita     Top15  Citable documents     Top15  PopEst     I want to know the correlation between the number of citable documents per capita and the energy supply per capita  So I use the  corr   method  Pearson s correlation    data   Top15   Citable docs per Capita   Energy Supply per Capita    correlation   data corr method  pearson     I want to return a single number  but the result is

User · Answer

I solved this problem by changing the data type  If you see the  Energy Supply per Capita  is a numerical type while the  Citable docs per Capita  is an object type  I converted the column to float using astype  I had the same problem with some np functions  count nonzero and sum worked while mean and std didn t

User · Answer

If you want the correlations between all pairs of columns  you could do something like this   import pandas as pd import numpy as np  def get corrs df       col correlations   df corr       col correlations loc         np tril col correlations  k -1      cor pairs   col correlations stack       return cor pairs to dict    my corrs   get corrs df    and the following line to retrieve the single correlation print my corrs   Citable docs per Capita   Energy Supply per Capita

User · Answer

When you call this   data   Top15   Citable docs per Capita   Energy Supply per Capita    correlation   data corr method  pearson     Since  DataFrame corr   function performs pair-wise correlations  you have four pair from two variables  So  basically you are getting diagonal values as auto correlation  correlation with itself  two values since you have two variables   and other two values as cross correlations of one vs another and vice versa    Either perform correlation between two series to get a single value   from scipy stats stats import pearsonr docs col   Top15  Citable docs per Capita   values energy col   Top15  Energy Supply per Capita   values corr       pearsonr docs col  energy col    or  if you want a single value from the same function  DataFrame s corr    single value   correlation 0  1     Hope this helps

User · Answer

I ran into the same issue  It appeared Citable Documents per Person was a float  and python skips it somehow by default  All the other columns of my dataframe were in numpy-formats  so I solved it by converting the columnt to np float64  Top15  Citable Documents per Person   np float64 Top15  Citable Documents per Person      Remember it s exactly the column you calculated yourself

User · Answer

Without actual data it is hard to answer the question but I guess you are looking for something like this   Top15  Citable docs per Capita   corr Top15  Energy Supply per Capita      That calculates the correlation between your two columns  Citable docs per Capita  and  Energy Supply per Capita    To give an example   import pandas as pd  df   pd DataFrame   A   range 4    B    2 i for i in range 4         A  B 0  0  0 1  1  2 2  2  4 3  3  6   Then  df  A   corr df  B      gives 1 as expected   Now  if you change a value  e g   df loc 2   B     4 5     A    B 0  0  0 0 1  1  2 0 2  2  4 5 3  3  6 0   the command   df  A   corr df  B      returns  0 99586   which is still close to 1  as expected   If you apply  corr directly to your dataframe  it will return all pairwise correlations between your columns  that s why you then observe 1s at the diagonal of your matrix  each column is perfectly correlated with itself    df corr     will therefore return            A         B A  1 000000  0 995862 B  0 995862  1 000000   In the graphic you show  only the upper left corner of the correlation matrix is represented  I assume    There can be cases  where you get NaNs in your solution - check this post for an example   If you want to filter entries above below a certain threshold  you can check this question  If you want to plot a heatmap of the correlation coefficients  you can check this answer and if you then run into the issue with overlapping axis-labels check the following post

User · Answer

My solution would be after converting data to numerical type   Top15   Citable docs per Capita   Energy Supply per Capita    corr

User · Answer

changing  Citable docs per Capita  to numeric before correlation will solve the problem      Top15  Citable docs per Capita     pd to numeric Top15  Citable docs per Capita        data   Top15   Citable docs per Capita   Energy Supply per Capita        correlation   data corr method  pearson

User · Answer

It works like this   Top15  Citable docs per Capita   np float64 Top15  Citable docs per Capita     Top15  Energy Supply per Capita   np float64 Top15  Energy Supply per Capita     Top15  Energy Supply per Capita   corr Top15  Citable docs per Capita

[python] Use .corr to get the correlation between two columns

Examples related to python

Examples related to pandas

Examples related to correlation