I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).
For example, with this data:
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
(output for above:)
y x1 x2 x3 x4 x5 x6 x7
-6.0 -4.95 -5.87 -0.76 14.73 4.02 0.20 0.45
-5.0 -4.55 -4.52 -0.71 13.74 4.47 0.16 0.50
-10.0 -10.96 -11.64 -0.98 15.49 4.18 0.19 0.53
-5.0 -1.08 -3.36 0.75 24.72 4.96 0.16 0.60
-8.0 -6.52 -7.45 -0.86 16.59 4.29 0.10 0.48
-3.0 -0.81 -2.36 -0.50 22.44 4.81 0.15 0.53
-6.0 -7.01 -7.33 -0.33 13.93 4.32 0.21 0.50
-8.0 -4.46 -7.65 -0.94 11.40 4.43 0.16 0.49
-8.0 -11.54 -10.03 -1.03 18.18 4.28 0.21 0.55
How would I regress these in python, to get the linear regression formula:
Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c
This question is related to
python
numpy
statistics
scipy
linear-regression
Here is a little work around that I created. I checked it with R and it works correct.
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
Result:
print reg_m(y, x).summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.535
Model: OLS Adj. R-squared: 0.461
Method: Least Squares F-statistic: 7.281
Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191
Time: 21:51:28 Log-Likelihood: -26.025
No. Observations: 23 AIC: 60.05
Df Residuals: 19 BIC: 64.59
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.2424 0.139 1.739 0.098 -0.049 0.534
x2 0.2360 0.149 1.587 0.129 -0.075 0.547
x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241
const 1.5704 0.633 2.481 0.023 0.245 2.895
==============================================================================
Omnibus: 6.904 Durbin-Watson: 1.905
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708
Skew: -0.849 Prob(JB): 0.0950
Kurtosis: 4.426 Cond. No. 38.6
pandas
provides a convenient way to run OLS as given in this answer:
You can use numpy.linalg.lstsq:
import numpy as np
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)
Result:
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
You can see the estimated output with:
print(np.dot(X,beta_hat))
Result:
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
Once you convert your data to a pandas dataframe (df
),
import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)
The intercept term is included by default.
See this notebook for more examples.
Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.
In short:
(Another source.)
Here is an alternative and basic method:
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())
Instead of sm.OLS
you can also use sm.Logit
or sm.Probit
and etc.
Finding a linear model such as this one can be handled with OpenTURNS.
In OpenTURNS this is done with the LinearModelAlgorithm
class which creates a linear model from numerical samples. To be more specific, it builds the following linear model :
Y = a0 + a1.X1 + ... + an.Xn + epsilon,
where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai :
from __future__ import print_function
import pandas as pd
import openturns as ot
# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")
# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)
# The observation points are in the first column (dimension 1)
Y = sample[:, 0]
# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]
# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()
# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())
You can then easily get the confidence intervals with the following call :
# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)
You may find a more detailed example in the OpenTURNS examples.
I think this may the most easy way to finish this work:
from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
print x.head()
x1 x2 x3 b
0 0.433681 0.946723 0.103422 1
1 0.400423 0.527179 0.131674 1
2 0.992441 0.900678 0.360140 1
3 0.413757 0.099319 0.825181 1
4 0.796491 0.862593 0.193554 1
print y.head()
0 6.637392
1 5.849802
2 7.874218
3 7.087938
4 7.102337
dtype: float64
model = OLS(y, x)
result = model.fit()
print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.859e+30
Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00
Time: 15:17:32 Log-Likelihood: 3224.9
No. Observations: 100 AIC: -6442.
Df Residuals: 96 BIC: -6431.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000
x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000
x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000
b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000
==============================================================================
Omnibus: 7.675 Durbin-Watson: 1.614
Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118
Skew: 0.045 Prob(JB): 0.210
Kurtosis: 2.140 Cond. No. 6.89
==============================================================================
try a generalized linear model with a gaussian family
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))
df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)
df=df.drop('X',axis=1)
print(df)
#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'
model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()
print(model_fit.summary())
# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params
# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)
# Extract and print confidence intervals
print(model_fit.conf_int())
df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)
df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)
prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)
plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()
Multiple Linear Regression can be handled using the sklearn library as referenced above. I'm using the Anaconda install of Python 3.6.
Create your model as follows:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
# display coefficients
print(regressor.coef_)
Use scipy.optimize.curve_fit
. And not only for linear fit.
from scipy.optimize import curve_fit
import scipy
def fn(x, a, b, c):
return a + b*x[0] + c*x[1]
# y(x0,x1) data:
# x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4
x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt
You can use the function below and pass it a DataFrame:
def linear(x, y=None, show=True):
"""
@param x: pd.DataFrame
@param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
@param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res
You can use numpy.linalg.lstsq
Scikit-learn is a machine learning library for Python which can do this job for you. Just import sklearn.linear_model module into your script.
Find the code template for Multiple Linear Regression using sklearn in Python:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
That's it. You can use this code as a template for implementing Multiple Linear Regression in any dataset. For a better understanding with an example, Visit: Linear Regression with an example
Source: Stackoverflow.com