Calculating Pearson correlation and significance in Python

Question

I am looking for a function that takes as input two lists  and returns the Pearson correlation  and the significance of the correlation

User · Answer

You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!

It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

User · Answer

This is a implementation of Pearson Correlation function using numpy     def corr data1  data2        data1   data2 should be numpy arrays       mean1   data1 mean        mean2   data2 mean       std1   data1 std       std2   data2 std          corr     data1-mean1   data2-mean2   mean    std1 std2      corr     data1 data2  mean  -mean1 mean2   std1 std2      return corr

User · Answer

Pearson coefficient calculation using pandas in python   I would suggest trying this approach since your data contains lists  It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish  You can also export the data set and save it and add new data out of the python console for later analysis  This code is simpler and contains less lines of code  I am assuming you need a few quick lines of code to screen your data for further analysis     Example   data     list 1   2 4 6 8   list 2   4 16 36 64    import pandas as pd  To Convert your lists to pandas data frames convert your lists into pandas dataframes  df   pd DataFrame data  columns     list 1   list 2     from scipy import stats   For in-built method to get PCC  pearson coef  p value   stats pearsonr df  list 1    df  list 2     define the columns to perform calculations on print  Pearson Correlation Coefficient     pearson coef   and a P-value of    p value    Results    However  you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis

User · Answer

Rather than rely on numpy scipy  I think my answer should be the easiest to code and understand the steps in calculating the Pearson Correlation Coefficient  PCC     import math    calculates the mean def mean x       sum   0 0     for i in x           sum    i     return sum   len x      calculates the sample standard deviation def sampleStandardDeviation x       sumv   0 0     for i in x           sumv     i - mean x    2     return math sqrt sumv  len x -1      calculates the PCC using both the 2 functions above def pearson x y       scorex          scorey           for i in x           scorex append  i - mean x   sampleStandardDeviation x         for j in y          scorey append  j - mean y   sampleStandardDeviation y      multiplies both lists together into 1 list  hence zip  and sums the whole list        return  sum  i j for i j in zip scorex scorey      len x -1    The significance of PCC is basically to show you how strongly correlated the two variables lists are   It is important to note that the PCC value ranges from -1 to 1  A value between 0 to 1 denotes a positive correlation  Value of 0   highest variation  no correlation whatsoever   A value between -1 to 0 denotes a negative correlation

User · Answer

Here s a variant on mkh s answer that runs much faster than it  and scipy stats pearsonr  using numba   import numba   numba jit def corr data1  data2       M   data1 size      sum1   0      sum2   0      for i in range M           sum1    data1 i          sum2    data2 i      mean1   sum1   M     mean2   sum2   M      var sum1   0      var sum2   0      cross sum   0      for i in range M           var sum1     data1 i  - mean1     2         var sum2     data2 i  - mean2     2         cross sum     data1 i    data2 i        std1    var sum1   M      5     std2    var sum2   M      5     cross mean   cross sum   M      return  cross mean - mean1   mean2     std1   std2

User · Answer

You can have a look at scipy stats   from pydoc import help from scipy stats stats import pearsonr help pearsonr    gt  gt  gt  Help on function pearsonr in module scipy stats stats   pearsonr x  y   Calculates a Pearson correlation coefficient and the p-value for testing  non-correlation    The Pearson correlation coefficient measures the linear relationship  between two datasets  Strictly speaking  Pearson s correlation requires  that each dataset be normally distributed  Like other correlation  coefficients  this one varies between -1 and  1 with 0 implying no  correlation  Correlations of -1 or  1 imply an exact linear  relationship  Positive correlations imply that as x increases  so does  y  Negative correlations imply that as x increases  y decreases    The p-value roughly indicates the probability of an uncorrelated system  producing datasets that have a Pearson correlation at least as extreme  as the one computed from these datasets  The p-values are not entirely  reliable but are probably reasonable for datasets larger than 500 or so    Parameters  ----------  x   1D array  y   1D array the same length as x   Returns  -------   Pearson s correlation coefficient    2-tailed p-value    References  ----------  http   www statsoft com textbook glosp html Pearson 20Correlation

User · Answer

The Pearson correlation can be calculated with numpy s corrcoef   import numpy numpy corrcoef list1  list2  0  1

User · Answer

If you don t feel like installing scipy  I ve used this quick hack  slightly modified from Programming Collective Intelligence  def pearsonr x  y       Assume len x     len y    n   len x    sum x   float sum x     sum y   float sum y     sum x sq   sum xi xi for xi in x    sum y sq   sum yi yi for yi in y    psum   sum xi yi for xi  yi in zip x  y     num   psum -  sum x   sum y n    den   pow  sum x sq - pow sum x  2    n     sum y sq - pow sum y  2    n   0 5    if den    0  return 0   return num   den

User · Answer

I have a very simple and easy to understand solution for this  For two arrays of equal length  Pearson coefficient can be easily computed as follows      def manual pearson a b       Accepts two arrays of equal length  and computes correlation coefficient   Numerator is the sum of product of  a - a avg  and  b - b avg    while denominator is the product of a std and b std multiplied by  length of array         a avg  b avg   np average a   np average b    a stdev  b stdev   np std a   np std b    n   len a    denominator   a stdev   b stdev   n   numerator   np sum np multiply a-a avg  b-b avg     p coef   numerator denominator   return p coef

User · Answer

You can take a look at this article  This is a well-documented example for calculating correlation based on historical forex currency pairs data from multiple files using pandas library  for Python   and then generating a heatmap plot using seaborn library   http   www tradinggeeks net 2015 08 calculating-correlation-in-python

User · Answer

The following code is a straight-up interpretation of the definition   import math  def average x       assert len x   gt  0     return float sum x     len x   def pearson def x  y       assert len x     len y      n   len x      assert n  gt  0     avg x   average x      avg y   average y      diffprod   0     xdiff2   0     ydiff2   0     for idx in range n           xdiff   x idx  - avg x         ydiff   y idx  - avg y         diffprod    xdiff   ydiff         xdiff2    xdiff   xdiff         ydiff2    ydiff   ydiff      return diffprod   math sqrt xdiff2   ydiff2    Test   print pearson def  1 2 3    1 5 7     returns  0 981980506062   This agrees with Excel  this calculator  SciPy  also NumPy   which return 0 981980506 and 0 9819805060619657  and 0 98198050606196574  respectively   R    gt  cor  c 1 2 3   c 1 5 7    1  0 9819805   EDIT  Fixed a bug pointed out by a commenter

User · Answer

An alternative can be a native scipy function from linregress which calculates   slope   slope of the regression line intercept   intercept of the regression line r-value   correlation coefficient p-value   two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero stderr   Standard error of the estimate  And here is an example  a    15  12  8  8  7  7  7  6  5  3  b    10  25  17  11  13  17  20  13  9  15  from scipy stats import linregress linregress a  b   will return you  LinregressResult slope 0 20833333333333337  intercept 13 375  rvalue 0 14499815458068521  pvalue 0 68940144811669501  stderr 0 50261704627083648

User · Answer

Hmm  many of these responses have long and hard to read code     I d suggest using numpy with its nifty features when working with arrays   import numpy as np def pcc X  Y          Compute Pearson Correlation Coefficient           Normalise X and Y    X -  X mean 0     Y -  Y mean 0       Standardise X and Y    X    X std 0     Y    Y std 0       Compute mean product    return np mean X Y     Using it on a random example from random import random X   np array  random   for x in xrange 100    Y   np array  random   for x in xrange 100    pcc X  Y

User · Answer

You can do this with pandas DataFrame corr  too   import pandas as pd a     1  2  3         5  6  9         5  6  11         5  6  13         5  3  13   df   pd DataFrame data a  df corr     This gives            0         1         2 0  1 000000  0 745601  0 916579 1  0 745601  1 000000  0 544248 2  0 916579  0 544248  1 000000

User · Answer

Here is an implementation for pearson correlation based on sparse vector  The vectors here are expressed as a list of tuples expressed as  index  value   The two sparse vectors can be of different length but over all vector size will have to be same  This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors    def get pearson corelation self  first feature vector     second feature vector     length of featureset 0       indexed feature dict          if first feature vector       or second feature vector       or length of featureset    0          raise ValueError  Empty feature vectors or zero length of featureset in get pearson corelation        sum a   sum value for index  value in first feature vector      sum b   sum value for index  value in second feature vector       avg a   float sum a    length of featureset     avg b   float sum b    length of featureset      mean sq error a   sqrt  sum  value - avg a     2 for index  value in first feature vector                length of featureset - len first feature vector       0 - avg a     2        mean sq error b   sqrt  sum  value - avg b     2 for index  value in second feature vector                length of featureset - len second feature vector       0 - avg b     2         covariance a b   0       calculate covariance for the sparse vectors     for tuple in first feature vector          if len tuple     2              raise ValueError  Invalid feature frequency tuple in featureVector   s      tuple           indexed feature dict tuple 0     tuple 1      count of features   0     for tuple in second feature vector          count of features    1         if len tuple     2              raise ValueError  Invalid feature frequency tuple in featureVector   s      tuple           if tuple 0  in indexed feature dict              covariance a b      indexed feature dict tuple 0   - avg a     tuple 1  - avg b               del  indexed feature dict tuple 0            else              covariance a b     0 - avg a     tuple 1  - avg b       for index in indexed feature dict          count of features    1         covariance a b     indexed feature dict index  - avg a     0 - avg b        adjust covariance with rest of vector with 0 value     covariance a b     length of featureset - count of features    -avg a   -avg b      if mean sq error a    0 or mean sq error b    0          return -1     else          return float covariance a b     mean sq error a   mean sq error b    Unit tests   def test get get pearson corelation self       vector a     1  1    2  2    3  3       vector b     1  1    2  5    3  7       self assertAlmostEquals self sim calculator get pearson corelation vector a  vector b  3   0 981980506062  3  None  None       vector a     1  1    2  2    3  3       vector b     1  1    2  5    3  7    4  14       self assertAlmostEquals self sim calculator get pearson corelation vector a  vector b  5   -0 0137089240555  3  None  None

User · Answer

def pearson x y     n len x    vals range n     sumx sum  float x i   for i in vals     sumy sum  float y i   for i in vals      sumxSq sum  x i   2 0 for i in vals     sumySq sum  y i   2 0 for i in vals      pSum sum  x i  y i  for i in vals       Calculating Pearson correlation   num pSum- sumx sumy n    den   sumxSq-pow sumx 2  n   sumySq-pow sumy 2  n     5   if den  0  return 0   r num den   return r

[python] Calculating Pearson correlation and significance in Python

Examples related to python

Examples related to numpy

Examples related to statistics

Examples related to scipy