Find p-value significance in scikit-learn LinearRegression

Question

How can I find the p-value  significance  of each coefficient   lm   sklearn linear model LinearRegression   lm fit x y

User · Answer

EDIT  Probably not the right way to do it  see comments  You could use sklearn feature selection f regression   click here for the scikit-learn page

User · Answer

For a one-liner you can use the pingouin linear regression function  disclaimer  I am the creator of Pingouin   which works with uni multi-variate regression using NumPy arrays or Pandas DataFrame  e g   import pingouin as pg   Using a Pandas DataFrame  df   lm   pg linear regression df   x    z     df  y      Using a NumPy array  lm   pg linear regression X  y    The output is a dataframe with the beta coefficients  standard errors  T-values  p-values and confidence intervals for each predictor  as well as the R 2 and adjusted R 2 of the fit

User · Answer

p value is among f statistics  if you want to get the value  simply use this few lines of code   import statsmodels api as sm from scipy import stats  diabetes   datasets load diabetes   X   diabetes data y   diabetes target  X2   sm add constant X  est   sm OLS y  X2  print est fit   f pvalue

User · Answer

There could be a mistake in  JARH s answer in the case of a multivariable regression    I do not have enough reputation to comment    In the following line   p values   2  1-stats t cdf np abs i   len newX -1    for i in ts b      the t-values follows a chi-squared distribution of degree len newX -1 instead of following a chi-squared distribution of degree len newX -len newX columns -1   So this should be    p values   2  1-stats t cdf np abs i   len newX -len newX columns -1    for i in ts b    See t-values for OLS regression for more details

User · Answer

You can use scipy for p-value  This code is from scipy documentation      gt  gt  gt  from scipy import stats  gt  gt  gt  import numpy as np  gt  gt  gt  x   np random random 10   gt  gt  gt  y   np random random 10   gt  gt  gt  slope  intercept  r value  p value  std err   stats linregress x y

User · Answer

scikit-learn s LinearRegression doesn t calculate this information but you can easily extend the class to do it   from sklearn import linear model from scipy import stats import numpy as np   class LinearRegression linear model LinearRegression               LinearRegression class after sklearn s  but calculate t-statistics     and p-values for model coefficients  betas       Additional attributes available after  fit       are  t  and  p  which are of the shape  y shape 1   X shape 1       which is  n features  n coefs      This class sets the intercept to 0 by default  since usually we include it     in X               def   init   self   args    kwargs           if not  fit intercept  in kwargs              kwargs  fit intercept     False         super LinearRegression  self                      init    args    kwargs       def fit self  X  y  n jobs 1           self   super LinearRegression  self  fit X  y  n jobs           sse   np sum  self predict X  - y     2  axis 0    float X shape 0  - X shape 1           se   np array               np sqrt np diagonal sse i    np linalg inv np dot X T  X                                                         for i in range sse shape 0                                   self t   self coef    se         self p   2    1 - stats t cdf np abs self t   y shape 0  - X shape 1            return self   Stolen from here    You should take a look at statsmodels for this kind of statistical analysis in Python

User · Answer

The code in elyase s answer https   stackoverflow com a 27928411 4240413 does not actually work   Notice that sse is a scalar  and then it tries to iterate through it   The following code is a modified version   Not amazingly clean  but I think it works more or less   class LinearRegression linear model LinearRegression        def   init   self  args   kwargs              args is the list of arguments that might go into the LinearRegression object           that we don t know about and don t want to have to deal with  Similarly    kwargs           is a dictionary of key words and values that might also need to go into the orginal           LinearRegression object  We put  args and   kwargs so that we don t have to look           these up and write them down explicitly here  Nice and easy           if not  fit intercept  in kwargs              kwargs  fit intercept     False          super LinearRegression self    init    args   kwargs         Adding in t-statistics for the coefficients      def fit self x y             This takes in numpy arrays  not matrices   Also assumes you are leaving out the column           of constants             Not totally sure what  super  does here and why you redefine self            self   super LinearRegression  self  fit x y          n  k   x shape         yHat   np matrix self predict x   T            Change X and Y into numpy matricies  x also has a column of ones added to it          x   np hstack  np ones  n 1   np matrix x            y   np matrix y  T            Degrees of freedom          df   float n-k-1             Sample variance               sse   np sum np square yHat - y  axis 0          self sampleVariance   sse df            Sample variance for x          self sampleVarianceX   x T x            Covariance Matrix     s 2  X X  -1  0 5   sqrtm   matrix square root   ugly          self covarianceMatrix   sc linalg sqrtm self sampleVariance 0 0  self sampleVarianceX I             Standard erros for the difference coefficients  the diagonal elements of the covariance matrix          self se   self covarianceMatrix diagonal   1              T statistic for each beta          self betasTStat   np zeros len self se           for i in xrange len self se                self betasTStat i    self coef  0 i  self se i             P-value for each beta  This is a two sided t-test  since the betas can be            positive or negative          self betasPValue   1 - t cdf abs self betasTStat  df

User · Answer

This is kind of overkill but let s give it a go   First lets use statsmodel to find out what the p-values should be import pandas as pd import numpy as np from sklearn import datasets  linear model from sklearn linear model import LinearRegression import statsmodels api as sm from scipy import stats  diabetes   datasets load diabetes   X   diabetes data y   diabetes target  X2   sm add constant X  est   sm OLS y  X2  est2   est fit   print est2 summary     and we get                          OLS Regression Results                                                                                                            Dep  Variable                       y   R-squared                        0 518 Model                             OLS   Adj  R-squared                   0 507 Method                  Least Squares   F-statistic                      46 27 Date                 Wed  08 Mar 2017   Prob  F-statistic             3 83e-62 Time                         10 08 24   Log-Likelihood                 -2386 0 No  Observations                  442   AIC                              4794  Df Residuals                      431   BIC                              4839  Df Model                           10                                          Covariance Type             nonrobust                                                                                                                                          coef    std err          t      P gt  t        0 025      0 975  ------------------------------------------------------------------------------ const        152 1335      2 576     59 061      0 000     147 071     157 196 x1           -10 0122     59 749     -0 168      0 867    -127 448     107 424 x2          -239 8191     61 222     -3 917      0 000    -360 151    -119 488 x3           519 8398     66 534      7 813      0 000     389 069     650 610 x4           324 3904     65 422      4 958      0 000     195 805     452 976 x5          -792 1842    416 684     -1 901      0 058   -1611 169      26 801 x6           476 7458    339 035      1 406      0 160    -189 621    1143 113 x7           101 0446    212 533      0 475      0 635    -316 685     518 774 x8           177 0642    161 476      1 097      0 273    -140 313     494 442 x9           751 2793    171 902      4 370      0 000     413 409    1089 150 x10           67 6254     65 984      1 025      0 306     -62 065     197 316                                                                                Omnibus                         1 506   Durbin-Watson                    2 029 Prob Omnibus                    0 471   Jarque-Bera  JB                  1 404 Skew                            0 017   Prob JB                          0 496 Kurtosis                        2 726   Cond  No                          227                                                                                  Ok  let s reproduce this   It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra   But what the heck  lm   LinearRegression   lm fit X y  params   np append lm intercept  lm coef   predictions   lm predict X   newX   pd DataFrame   quot Constant quot  np ones len X     join pd DataFrame X   MSE    sum  y-predictions   2    len newX -len newX columns      Note if you don t want to use a DataFrame replace the two lines above with   newX   np append np ones  len X  1    X  axis 1    MSE    sum  y-predictions   2    len newX -len newX 0     var b   MSE  np linalg inv np dot newX T newX   diagonal    sd b   np sqrt var b  ts b   params  sd b  p values   2  1-stats t cdf np abs i   len newX -len newX 0      for i in ts b   sd b   np round sd b 3  ts b   np round ts b 3  p values   np round p values 3  params   np round params 4   myDF3   pd DataFrame   myDF3  quot Coefficients quot   myDF3  quot Standard Errors quot   myDF3  quot t values quot   myDF3  quot Probabilities quot      params sd b ts b p values  print myDF3   And this gives us      Coefficients  Standard Errors  t values  Probabilities 0       152 1335            2 576    59 061         0 000 1       -10 0122           59 749    -0 168         0 867 2      -239 8191           61 222    -3 917         0 000 3       519 8398           66 534     7 813         0 000 4       324 3904           65 422     4 958         0 000 5      -792 1842          416 684    -1 901         0 058 6       476 7458          339 035     1 406         0 160 7       101 0446          212 533     0 475         0 635 8       177 0642          161 476     1 097         0 273 9       751 2793          171 902     4 370         0 000 10       67 6254           65 984     1 025         0 306  So we can reproduce the values from statsmodel

User · Answer

An easy way to pull of the p-values is to use statsmodels regression   import statsmodels api as sm mod   sm OLS Y X  fii   mod fit   p values   fii summary2   tables 1   P gt  t      You get a series of p-values that you can manipulate  for example choose the order you want to keep by evaluating each p-value

[python] Find p-value (significance) in scikit-learn LinearRegression

Examples related to python

Examples related to numpy

Examples related to statistics

Examples related to scikit-learn

Examples related to regression