[r] How to fit a smooth curve to my data in R?

The other answers are all good approaches. However, there are a few other options in R that haven't been mentioned, including lowess and approx, which may give better fits or faster performance.

The advantages are more easily demonstrated with an alternate dataset:

sigmoid <- function(x)
{
  y<-1/(1+exp(-.15*(x-100)))
  return(y)
}

dat<-data.frame(x=rnorm(5000)*30+100)
dat$y<-as.numeric(as.logical(round(sigmoid(dat$x)+rnorm(5000)*.3,0)))

Here is the data overlaid with the sigmoid curve that generated it:

Data

This sort of data is common when looking at a binary behavior among a population. For example, this might be a plot of whether or not a customer purchased something (a binary 1/0 on the y-axis) versus the amount of time they spent on the site (x-axis).

A large number of points are used to better demonstrate the performance differences of these functions.

Smooth, spline, and smooth.spline all produce gibberish on a dataset like this with any set of parameters I have tried, perhaps due to their tendency to map to every point, which does not work for noisy data.

The loess, lowess, and approx functions all produce usable results, although just barely for approx. This is the code for each using lightly optimized parameters:

loessFit <- loess(y~x, dat, span = 0.6)
loessFit <- data.frame(x=loessFit$x,y=loessFit$fitted)
loessFit <- loessFit[order(loessFit$x),]

approxFit <- approx(dat,n = 15)

lowessFit <-data.frame(lowess(dat,f = .6,iter=1))

And the results:

plot(dat,col='gray')
curve(sigmoid,0,200,add=TRUE,col='blue',)
lines(lowessFit,col='red')
lines(loessFit,col='green')
lines(approxFit,col='purple')
legend(150,.6,
       legend=c("Sigmoid","Loess","Lowess",'Approx'),
       lty=c(1,1),
       lwd=c(2.5,2.5),col=c("blue","green","red","purple"))

Fits

As you can see, lowess produces a near perfect fit to the original generating curve. Loess is close, but experiences a strange deviation at both tails.

Although your dataset will be very different, I have found that other datasets perform similarly, with both loess and lowess capable of producing good results. The differences become more significant when you look at benchmarks:

> microbenchmark::microbenchmark(loess(y~x, dat, span = 0.6),approx(dat,n = 20),lowess(dat,f = .6,iter=1),times=20)
Unit: milliseconds
                           expr        min         lq       mean     median        uq        max neval cld
  loess(y ~ x, dat, span = 0.6) 153.034810 154.450750 156.794257 156.004357 159.23183 163.117746    20   c
            approx(dat, n = 20)   1.297685   1.346773   1.689133   1.441823   1.86018   4.281735    20 a  
 lowess(dat, f = 0.6, iter = 1)   9.637583  10.085613  11.270911  11.350722  12.33046  12.495343    20  b 

Loess is extremely slow, taking 100x as long as approx. Lowess produces better results than approx, while still running fairly quickly (15x faster than loess).

Loess also becomes increasingly bogged down as the number of points increases, becoming unusable around 50,000.

EDIT: Additional research shows that loess gives better fits for certain datasets. If you are dealing with a small dataset or performance is not a consideration, try both functions and compare the results.

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to plot

Fine control over the font size in Seaborn plots for academic papers Why do many examples use `fig, ax = plt.subplots()` in Matplotlib/pyplot/python Modify the legend of pandas bar plot Format y axis as percent Simple line plots using seaborn Plot bar graph from Pandas DataFrame Plotting multiple lines, in different colors, with pandas dataframe Plotting in a non-blocking way with Matplotlib What does the error "arguments imply differing number of rows: x, y" mean? matplotlib get ylim values

Examples related to curve-fitting

python numpy/scipy curve fitting fitting data with numpy Fitting a histogram with python Linear regression with matplotlib / numpy Fitting polynomial model to data in R How to fit a smooth curve to my data in R? How to do exponential and logarithmic curve fitting in Python? I found only polynomial fitting Fitting a density curve to a histogram in R How do I calculate r-squared using Python and Numpy?