Multiple linear regression in Python

Question

I can t seem to find any python libraries that do multiple regression  The only things I find only do simple regression  I need to regress my dependent variable  y  against several independent variables  x1  x2  x3  etc     For example  with this data   print  y        x1      x2       x3       x4      x5     x6       x7  for t in texts      print     gt 7 1f    gt 10 2f    gt 9 2f    gt 9 2f    gt 10 2f    gt 7 2f    gt 7 2f    gt 9 2f         format t y t x1 t x2 t x3 t x4 t x5 t x6 t x7     output for above          y        x1       x2       x3        x4     x5     x6       x7    -6 0     -4 95    -5 87    -0 76     14 73   4 02   0 20     0 45    -5 0     -4 55    -4 52    -0 71     13 74   4 47   0 16     0 50   -10 0    -10 96   -11 64    -0 98     15 49   4 18   0 19     0 53    -5 0     -1 08    -3 36     0 75     24 72   4 96   0 16     0 60    -8 0     -6 52    -7 45    -0 86     16 59   4 29   0 10     0 48    -3 0     -0 81    -2 36    -0 50     22 44   4 81   0 15     0 53    -6 0     -7 01    -7 33    -0 33     13 93   4 32   0 21     0 50    -8 0     -4 46    -7 65    -0 94     11 40   4 43   0 16     0 49    -8 0    -11 54   -10 03    -1 03     18 18   4 28   0 21     0 55   How would I regress these in python  to get the linear regression formula      Y   a1x1   a2x2   a3x3   a4x4   a5x5   a6x6    a7x7   c

User · Answer

Finding a linear model such as this one can be handled with OpenTURNS.

In OpenTURNS this is done with the LinearModelAlgorithmclass which creates a linear model from numerical samples. To be more specific, it builds the following linear model :

Y = a0 + a1.X1 + ... + an.Xn + epsilon,

where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai :

from __future__ import print_function
import pandas as pd
import openturns as ot

# Assuming the data is a csv file with the given structure                          
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")

# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)

# The observation points are in the first column (dimension 1)
Y = sample[:, 0]

# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]

# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()

# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())

You can then easily get the confidence intervals with the following call :

# Get the confidence intervals at 90% of the ai coefficients
print(
    "confidence intervals of the coefficients = ",
    ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)

You may find a more detailed example in the OpenTURNS examples.

User · Answer

try a generalized linear model with a gaussian family y   np array  -6  -5  -10  -5  -8  -3  -6  -8  -8   X   np array        -4 95  -4 55  -10 96  -1 08  -6 52  -0 81  -7 01  -4 46  -11 54        -5 87  -4 52  -11 64  -3 36  -7 45  -2 36  -7 33  -7 65  -10 03        -0 76  -0 71  -0 98  0 75  -0 86  -0 50  -0 33  -0 94  -1 03        14 73  13 74  15 49  24 72  16 59  22 44  13 93  11 40  18 18        4 02  4 47  4 18  4 96  4 29  4 81  4 32  4 43  4 28        0 20  0 16  0 19  0 16  0 10  0 15  0 21  0 16  0 21        0 45  0 50  0 53  0 60  0 48  0 53  0 50  0 49  0 55      X zip  reversed X    df pd DataFrame   X  X  y  y   columns 7 for i in range 0 columns       df  X  str i   df apply lambda row  row  X   i  axis 1   df df drop  X  axis 1  print df     model formula  y   X0 X1 X2 X3 X4 X5 X6  model formula  y   X0   model family   sm families Gaussian   model fit   glm formula   model formula                data   df                family   model family  fit    print model fit summary       Extract coefficients from the fitted model wells fit  print model fit params  intercept  slope   model fit params    Print coefficients print  Intercept     intercept  print  Slope     slope     Extract and print confidence intervals print model fit conf int     df2 pd DataFrame   df2  X0   np linspace 0 50 0 70 50   df3 pd DataFrame   df3  X1   np linspace 0 20 0 60 50   prediction0 model fit predict df2   prediction1 model fit predict df3   plt plot df2  X0   prediction0 label  X0   plt ylabel  quot y quot   plt xlabel  quot X0 quot   plt show

User · Answer

Use scipy optimize curve fit  And not only for linear fit   from scipy optimize import curve fit import scipy  def fn x  a  b  c       return a   b x 0    c x 1     y x0 x1  data       x0 0 1 2                 x1 0  0 1 2   x1 1  1 2 3   x1 2  2 3 4  x   scipy array   0 1 2 0 1 2 0 1 2    0 0 0 1 1 1 2 2 2    y   scipy array  0 1 2 1 2 3 2 3 4   popt  pcov   curve fit fn  x  y  print popt

User · Answer

You can use numpy linalg lstsq

User · Answer

You can use the function below and pass it a DataFrame   def linear x  y None  show True                param x  pd DataFrame      param y  pd DataFrame or pd Series or None               if None  then use last column of x as y      param show  if show regression summary             import statsmodels api as sm      xy   sm add constant x if y is None else pd concat  x  y   axis 1       res   sm OLS xy ix    -1   xy ix     -1   missing  drop   fit        if show  print res summary       return res

User · Answer

sklearn linear model LinearRegression will do it   from sklearn import linear model clf   linear model LinearRegression   clf fit   getattr t   x d    i  for i in range 1  8   for t in texts            t y for t in texts     Then clf coef  will have the regression coefficients   sklearn linear model also has similar interfaces to do various kinds of regularizations on the regression

User · Answer

Just to clarify  the example you gave is multiple linear regression  not multivariate linear regression refer  Difference      The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression  The extension to multiple and or vector-valued predictor variables  denoted with a capital X  is known as multiple linear regression  also known as multivariable linear regression  Nearly all real-world regression models involve multiple predictors  and basic descriptions of linear regression are often phrased in terms of the multiple regression model  Note  however  that in these cases the response variable y is still a scalar  Another term multivariate linear regression refers to cases where y is a vector  i e   the same as general linear regression  The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature    In short    multiple linear regression  the response y is a scalar  multivariate linear regression  the response y is a vector     Another source

User · Answer

Here is an alternative and basic method   from patsy import dmatrices import statsmodels api as sm  y x   dmatrices  y data   x 1   x 2    data   my data      y data is the name of the dependent variable in your data      model fit   sm OLS y x  results   model fit fit   print results summary      Instead of sm OLS you can also use sm Logit or sm Probit and etc

User · Answer

I think this may the most easy way to finish this work    from random import random from pandas import DataFrame from statsmodels api import OLS lr   lambda    random   for i in range 100   x   DataFrame   x1   lr     x2  lr     x3  lr     x  b     1 y   x x1   x x2   2   x x3   3   4  print x head             x1        x2        x3  b 0  0 433681  0 946723  0 103422  1 1  0 400423  0 527179  0 131674  1 2  0 992441  0 900678  0 360140  1 3  0 413757  0 099319  0 825181  1 4  0 796491  0 862593  0 193554  1  print y head    0    6 637392 1    5 849802 2    7 874218 3    7 087938 4    7 102337 dtype  float64  model   OLS y  x  result   model fit   print result summary                                OLS Regression Results                                                                                                            Dep  Variable                       y   R-squared                        1 000 Model                             OLS   Adj  R-squared                   1 000 Method                  Least Squares   F-statistic                  5 859e 30 Date                 Wed  09 Dec 2015   Prob  F-statistic                 0 00 Time                         15 17 32   Log-Likelihood                  3224 9 No  Observations                  100   AIC                             -6442  Df Residuals                       96   BIC                             -6431  Df Model                            3                                          Covariance Type             nonrobust                                                                                                                                          coef    std err          t      P gt  t        95 0  Conf  Int   ------------------------------------------------------------------------------ x1             1 0000   8 98e-16   1 11e 15      0 000         1 000     1 000 x2             2 0000   8 28e-16   2 41e 15      0 000         2 000     2 000 x3             3 0000   8 34e-16    3 6e 15      0 000         3 000     3 000 b              4 0000   8 51e-16    4 7e 15      0 000         4 000     4 000                                                                                Omnibus                         7 675   Durbin-Watson                    1 614 Prob Omnibus                    0 022   Jarque-Bera  JB                  3 118 Skew                            0 045   Prob JB                          0 210 Kurtosis                        2 140   Cond  No                          6 89

User · Answer

Here is a little work around that I created  I checked it with R and it works correct   import numpy as np import statsmodels api as sm  y    1 2 3 4 3 4 5 4 5 5 4 5 4 5 4 5 6 5 4 5 4 3 4   x           4 2 3 4 5 4 5 6 7 4 8 9 8 8 6 6 5 5 5 5 5 5 5         4 1 2 3 4 5 6 7 5 8 7 8 7 8 7 8 7 7 7 7 7 6 5         4 1 2 5 6 7 8 9 7 8 7 8 7 7 7 7 7 7 6 6 4 4 4          def reg m y  x       ones   np ones len x 0        X   sm add constant np column stack  x 0   ones        for ele in x 1            X   sm add constant np column stack  ele  X        results   sm OLS y  X  fit       return results   Result   print reg m y  x  summary     Output                               OLS Regression Results                                                                                                            Dep  Variable                       y   R-squared                        0 535 Model                             OLS   Adj  R-squared                   0 461 Method                  Least Squares   F-statistic                      7 281 Date                 Tue  19 Feb 2013   Prob  F-statistic              0 00191 Time                         21 51 28   Log-Likelihood                 -26 025 No  Observations                   23   AIC                              60 05 Df Residuals                       19   BIC                              64 59 Df Model                            3                                                                                                                                          coef    std err          t      P gt  t        95 0  Conf  Int   ------------------------------------------------------------------------------ x1             0 2424      0 139      1 739      0 098        -0 049     0 534 x2             0 2360      0 149      1 587      0 129        -0 075     0 547 x3            -0 0618      0 145     -0 427      0 674        -0 365     0 241 const          1 5704      0 633      2 481      0 023         0 245     2 895                                                                                 Omnibus                         6 904   Durbin-Watson                    1 905 Prob Omnibus                    0 032   Jarque-Bera  JB                  4 708 Skew                           -0 849   Prob JB                         0 0950 Kurtosis                        4 426   Cond  No                          38 6   pandas provides a convenient way to run OLS as given in this answer   Run an OLS regression with Pandas Data Frame

User · Answer

Multiple Linear Regression can be handled using the sklearn library as referenced above   I m using the Anaconda install of Python 3 6   Create your model as follows   from sklearn linear model import LinearRegression regressor   LinearRegression   regressor fit X  y     display coefficients print regressor coef

User · Answer

You can use numpy linalg lstsq  import numpy as np  y   np array  -6  -5  -10  -5  -8  -3  -6  -8  -8   X   np array                 -4 95  -4 55  -10 96  -1 08  -6 52  -0 81  -7 01  -4 46  -11 54            -5 87  -4 52  -11 64  -3 36  -7 45  -2 36  -7 33  -7 65  -10 03            -0 76  -0 71  -0 98  0 75  -0 86  -0 50  -0 33  -0 94  -1 03            14 73  13 74  15 49  24 72  16 59  22 44  13 93  11 40  18 18            4 02  4 47  4 18  4 96  4 29  4 81  4 32  4 43  4 28            0 20  0 16  0 19  0 16  0 10  0 15  0 21  0 16  0 21            0 45  0 50  0 53  0 60  0 48  0 53  0 50  0 49  0 55           X   X T    transpose so input vectors are along the rows X   np c  X  np ones X shape 0       add bias term beta hat   np linalg lstsq X  y  rcond None  0  print beta hat   Result    -0 49104607   0 83271938   0 0860167    0 1326091    6 85681762  22 98163883 -41 08437805 -19 08085066   You can see the estimated output with  print np dot X beta hat    Result    -5 97751163   -5 06465759  -10 16873217   -4 96959788   -7 96356915   -3 06176313   -6 01818435   -7 90878145   -7 86720264

User · Answer

Once you convert your data to a pandas dataframe  df     import statsmodels formula api as smf lm   smf ols formula  y   x1   x2   x3   x4   x5   x6   x7   data df  fit   print lm params    The intercept term is included by default   See this notebook for more examples

User · Answer

Scikit-learn is a machine learning library for Python which can do this job for you   Just import sklearn linear model module into your script   Find the code template for Multiple Linear Regression using sklearn in Python   import numpy as np import matplotlib pyplot as plt  to plot visualizations import pandas as pd    Importing the dataset df   pd read csv  lt Your-dataset-path gt     Assigning feature and target variables X   df iloc    -1  y   df iloc   -1     Use label encoders  if you have any categorical variable from sklearn preprocessing import LabelEncoder labelencoder   LabelEncoder   X   lt column-name gt      labelencoder fit transform X   lt column-name gt      from sklearn preprocessing import OneHotEncoder onehotencoder   OneHotEncoder categorical features      lt index-value gt     X   onehotencoder fit transform X  toarray      Avoiding the dummy variable trap X   X   1     Usually done by the algorithm itself   Spliting the data into test and train set from sklearn model selection import train test split X train  X test  y train  y test   train test split X y  random state   0  test size   0 2     Fitting the model from sklearn linear model import LinearRegression regressor   LinearRegression   regressor fit X train  y train     Predicting the test set results y pred   regressor predict X test    That s it  You can use this code as a template for implementing Multiple Linear Regression in any dataset   For a better understanding with an example  Visit  Linear Regression with an example

[python] Multiple linear regression in Python

Examples related to python

Examples related to numpy

Examples related to statistics

Examples related to scipy

Examples related to linear-regression