[python] How to overplot a line on a scatter plot in python?

I have two vectors of data and I've put them into matplotlib.scatter(). Now I'd like to over plot a linear fit to these data. How would I do this? I've tried using scikitlearn and np.scatter.

This question is related to python numpy matplotlib

The answer is


I'm partial to scikits.statsmodels. Here an example:

import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

X = np.random.rand(100)
Y = X + np.random.rand(100)*0.1

results = sm.OLS(Y,sm.add_constant(X)).fit()

print results.summary()

plt.scatter(X,Y)

X_plot = np.linspace(0,1,100)
plt.plot(X_plot, X_plot*results.params[0] + results.params[1])

plt.show()

The only tricky part is sm.add_constant(X) which adds a columns of ones to X in order to get an intercept term.

     Summary of Regression Results
=======================================
| Dependent Variable:            ['y']|
| Model:                           OLS|
| Method:                Least Squares|
| Date:               Sat, 28 Sep 2013|
| Time:                       09:22:59|
| # obs:                         100.0|
| Df residuals:                   98.0|
| Df model:                        1.0|
==============================================================================
|                   coefficient     std. error    t-statistic          prob. |
------------------------------------------------------------------------------
| x1                      1.007       0.008466       118.9032         0.0000 |
| const                 0.05165       0.005138        10.0515         0.0000 |
==============================================================================
|                          Models stats                      Residual stats  |
------------------------------------------------------------------------------
| R-squared:                     0.9931   Durbin-Watson:              1.484  |
| Adjusted R-squared:            0.9930   Omnibus:                    12.16  |
| F-statistic:                1.414e+04   Prob(Omnibus):           0.002294  |
| Prob (F-statistic):        9.137e-108   JB:                        0.6818  |
| Log likelihood:                 223.8   Prob(JB):                  0.7111  |
| AIC criterion:                 -443.7   Skew:                     -0.2064  |
| BIC criterion:                 -438.5   Kurtosis:                   2.048  |
------------------------------------------------------------------------------

example plot


You can use this tutorial by Adarsh Menon https://towardsdatascience.com/linear-regression-in-6-lines-of-python-5e1d0cd05b8d

This way is the easiest I found and it basically looks like:

import numpy as np
import matplotlib.pyplot as plt  # To visualize
import pandas as pd  # To read data
from sklearn.linear_model import LinearRegression
data = pd.read_csv('data.csv')  # load data set
X = data.iloc[:, 0].values.reshape(-1, 1)  # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1)  # -1 means that calculate the dimension of rows, but have 1 column
linear_regressor = LinearRegression()  # create object for the class
linear_regressor.fit(X, Y)  # perform linear regression
Y_pred = linear_regressor.predict(X)  # make predictions
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

plt.plot(X_plot, X_plot*results.params[0] + results.params[1])

versus

plt.plot(X_plot, X_plot*results.params[1] + results.params[0])

A one-line version of this excellent answer to plot the line of best fit is:

plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))

Using np.unique(x) instead of x handles the case where x isn't sorted or has duplicate values.

The call to poly1d is an alternative to writing out m*x + b like in this other excellent answer.


I like Seaborn's regplot or lmplot for this:

enter image description here


Another way to do it, using axes.get_xlim():

import matplotlib.pyplot as plt
import numpy as np

def scatter_plot_with_correlation_line(x, y, graph_filepath):
    '''
    http://stackoverflow.com/a/34571821/395857
    x does not have to be ordered.
    '''
    # Create scatter plot
    plt.scatter(x, y)

    # Add correlation line
    axes = plt.gca()
    m, b = np.polyfit(x, y, 1)
    X_plot = np.linspace(axes.get_xlim()[0],axes.get_xlim()[1],100)
    plt.plot(X_plot, m*X_plot + b, '-')

    # Save figure
    plt.savefig(graph_filepath, dpi=300, format='png', bbox_inches='tight')

def main():
    # Data
    x = np.random.rand(100)
    y = x + np.random.rand(100)*0.1

    # Plot
    scatter_plot_with_correlation_line(x, y, 'scatter_plot.png')

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

enter image description here


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index

Examples related to matplotlib

"UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure." when plotting figure with pyplot on Pycharm How to increase image size of pandas.DataFrame.plot in jupyter notebook? How to create a stacked bar chart for my DataFrame using seaborn? How to display multiple images in one figure correctly? Edit seaborn legend How to hide axes and gridlines in Matplotlib (python) How to set x axis values in matplotlib python? How to specify legend position in matplotlib in graph coordinates Python "TypeError: unhashable type: 'slice'" for encoding categorical data Seaborn Barplot - Displaying Values