Run an OLS regression with Pandas Data Frame

Question

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C  Here is a toy example   import pandas as pd df   pd DataFrame   A    10 20 30 40 50                        B    20  30  10  40  50                        C    32  234  23  23  42523      Ideally  I would have something like ols A   B   C  data   df  but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns  This would require me to reformat the data into lists inside lists  which seems to defeat the purpose of using pandas in the first place  What is the most pythonic way to run an OLS regression  or any machine learning algorithm more generally  on data in a pandas data frame

User · Answer

B is not statistically significant   The data is not capable of drawing inferences from it   C does influence B probabilities  df   pd DataFrame   quot A quot    10 20 30 40 50    quot B quot    20  30  10  40  50    quot C quot    32  234  23  23  42523      avg c df  C   mean    sumC df  C   apply lambda x  x if x lt avg c else 0  sum    countC df  C   apply lambda x  1 if x lt avg c else None  count    avg c2 sumC countC  df  C   df  C   apply lambda x  avg c2 if x  gt avg c else x     print df    model ols   smf ols  quot A   B C quot  data df  fit     print model ols summary      df   B   C    plot    plt show      df2 pd DataFrame    df2  B   np linspace 10 50 10   df2  C   30   df3 pd DataFrame    df3  B   np linspace 10 50 10   df3  C   100   predB model ols predict df2   predC model ols predict df3   plt plot df2  B   predB label  predict B C 30    plt plot df3  B   predC label  predict B C 100    plt legend    plt show     print  quot A change in the probability of C affects the probability of B quot     intercept model ols params loc  Intercept    B slope model ols params loc  B    C slope model ols params loc  C     Intercept    11 874252   B             0 760859   C            -0 060257   print  quot Intercept    n B slope   n C    slope   n quot  format intercept B slope C slope       lower conf upper conf np exp model ols conf int      print lower conf upper conf    print  1- lower conf upper conf   100    model cov model ols cov params    std errorB   np sqrt model cov loc  B    B     std errorC   np sqrt model cov loc  C    C     print  SE     round std errorB  4  round std errorC  4     check for statistically significant  print  quot B z value    C z value    quot  format  B slope std errorB   C slope std errorC     print  quot B feature is more statistically significant than C quot      Output    A change in the probability of C affects the probability of B  Intercept 11 874251554067563  B slope0 7608594144571961  C slope-0 060256845997223814   Standard Error   0 4519 0 0793  B z value 1 683510336937001 C z value -0 7601036314930376  B feature is more statistically significant than C   z gt 2 is statistically significant

User · Answer

I don t know if this is new in sklearn or pandas  but I m able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types   from sklearn import linear model  reg   linear model LinearRegression   reg fit df   B    C     df  A      gt  gt  gt  reg coef  array    4 01182386e-01    3 51587361e-04

User · Answer

Note  pandas stats has been removed with 0 20 0    It s possible to do this with pandas stats ols    gt  gt  gt  from pandas stats api import ols  gt  gt  gt  df   pd DataFrame   A    10 20 30 40 50    B    20  30  10  40  50    C    32  234  23  23  42523     gt  gt  gt  res   ols y df  A    x df   B   C      gt  gt  gt  res -------------------------Summary of Regression Analysis-------------------------  Formula  Y    lt B gt     lt C gt     lt intercept gt   Number of Observations          5 Number of Degrees of Freedom    3  R-squared          0 5789 Adj R-squared      0 1577  Rmse              14 5108  F-stat  2  2       1 3746  p-value      0 4211  Degrees of Freedom  model 2  resid 2  -----------------------Summary of Estimated Coefficients------------------------       Variable       Coef    Std Err     t-stat    p-value    CI 2 5    CI 97 5  --------------------------------------------------------------------------------              B     0 4012     0 6497       0 62     0 5999    -0 8723     1 6746              C     0 0004     0 0005       0 65     0 5826    -0 0007     0 0014      intercept    14 9525    17 7643       0 84     0 4886   -19 8655    49 7705 ---------------------------------End of Summary---------------------------------   Note that you need to have statsmodels package installed  it is used internally by the pandas stats ols function

User · Answer

I think you can almost do exactly what you thought would be ideal  using the statsmodels package which was one of pandas  optional dependencies before pandas  version 0 20 0  it was used for a few things in pandas stats     gt  gt  gt  import pandas as pd  gt  gt  gt  import statsmodels formula api as sm  gt  gt  gt  df   pd DataFrame   A    10 20 30 40 50    B    20  30  10  40  50    C    32  234  23  23  42523     gt  gt  gt  result   sm ols formula  A   B   C   data df  fit    gt  gt  gt  print result params  Intercept    14 952480 B             0 401182 C             0 000352 dtype  float64  gt  gt  gt  print result summary                                OLS Regression Results                                                                                                            Dep  Variable                       A   R-squared                        0 579 Model                             OLS   Adj  R-squared                   0 158 Method                  Least Squares   F-statistic                      1 375 Date                 Thu  14 Nov 2013   Prob  F-statistic                0 421 Time                         20 04 30   Log-Likelihood                 -18 178 No  Observations                    5   AIC                              42 36 Df Residuals                        2   BIC                              41 19 Df Model                            2                                                                                                                                          coef    std err          t      P gt  t        95 0  Conf  Int   ------------------------------------------------------------------------------ Intercept     14 9525     17 764      0 842      0 489       -61 481    91 386 B              0 4012      0 650      0 617      0 600        -2 394     3 197 C              0 0004      0 001      0 650      0 583        -0 002     0 003                                                                                Omnibus                           nan   Durbin-Watson                    1 061 Prob Omnibus                      nan   Jarque-Bera  JB                  0 498 Skew                           -0 123   Prob JB                          0 780 Kurtosis                        1 474   Cond  No                      5 21e 04                                                                                 Warnings   1  The condition number is large  5 21e 04  This might indicate that there are strong multicollinearity or other numerical problems

User · Answer

This would require me to reformat the data into lists inside lists  which seems to defeat the purpose of using pandas in the first place    No it doesn t  just convert to a NumPy array    gt  gt  gt  data   np asarray df    This takes constant time because it just creates a view on your data  Then feed it to scikit-learn    gt  gt  gt  from sklearn linear model import LinearRegression  gt  gt  gt  lr   LinearRegression    gt  gt  gt  X  y   data    1    data    0   gt  gt  gt  lr fit X  y  LinearRegression copy X True  fit intercept True  normalize False   gt  gt  gt  lr coef  array    4 01182386e-01    3 51587361e-04    gt  gt  gt  lr intercept  14 952479503953672

User · Answer

Statsmodels kan build an OLS model with column references directly to a pandas dataframe   Short and sweet   model   sm OLS df y   df x   fit      Code details and regression summary     imports import pandas as pd import statsmodels api as sm import numpy as np    data np random seed 123  df   pd DataFrame np random randint 0 100 size  100  3    columns list  ABC       assign dependent and independent   explanatory variables variables   list df columns  y    A  x    var for var in variables if var not in y      Ordinary least squares regression model Simple   sm OLS df y   df x   fit      Add a constant term like so  model   sm OLS df y   sm add constant df x    fit    model summary     Output                               OLS Regression Results                                                                                                            Dep  Variable                       A   R-squared                        0 019 Model                             OLS   Adj  R-squared                  -0 001 Method                  Least Squares   F-statistic                     0 9409 Date                 Thu  14 Feb 2019   Prob  F-statistic                0 394 Time                         08 35 04   Log-Likelihood                 -484 49 No  Observations                  100   AIC                              975 0 Df Residuals                       97   BIC                              982 8 Df Model                            2                                          Covariance Type             nonrobust                                                                                                                                          coef    std err          t      P gt  t        0 025      0 975  ------------------------------------------------------------------------------ const         43 4801      8 809      4 936      0 000      25 996      60 964 B              0 1241      0 105      1 188      0 238      -0 083       0 332 C             -0 0752      0 110     -0 681      0 497      -0 294       0 144                                                                                Omnibus                        50 990   Durbin-Watson                    2 013 Prob Omnibus                    0 000   Jarque-Bera  JB                  6 905 Skew                            0 032   Prob JB                         0 0317 Kurtosis                        1 714   Cond  No                          231                                                                                   How to directly get R-squared  Coefficients and p-value     commands  model params model pvalues model rsquared    demo  In 1    model params Out 1   const    43 480106 B         0 124130 C        -0 075156 dtype  float64  In 2    model pvalues Out 2    const    0 000003 B        0 237924 C        0 497400 dtype  float64  Out 3   model rsquared Out 2   0 0190

[python] Run an OLS regression with Pandas Data Frame

Examples related to python

Examples related to pandas

Examples related to scikit-learn

Examples related to regression

Examples related to statsmodels