The other answers are all good approaches. However, there are a few other options in R that haven't been mentioned, including lowess
and approx
, which may give better fits or faster performance.
The advantages are more easily demonstrated with an alternate dataset:
sigmoid <- function(x)
{
y<-1/(1+exp(-.15*(x-100)))
return(y)
}
dat<-data.frame(x=rnorm(5000)*30+100)
dat$y<-as.numeric(as.logical(round(sigmoid(dat$x)+rnorm(5000)*.3,0)))
Here is the data overlaid with the sigmoid curve that generated it:
This sort of data is common when looking at a binary behavior among a population. For example, this might be a plot of whether or not a customer purchased something (a binary 1/0 on the y-axis) versus the amount of time they spent on the site (x-axis).
A large number of points are used to better demonstrate the performance differences of these functions.
Smooth
, spline
, and smooth.spline
all produce gibberish on a dataset like this with any set of parameters I have tried, perhaps due to their tendency to map to every point, which does not work for noisy data.
The loess
, lowess
, and approx
functions all produce usable results, although just barely for approx
. This is the code for each using lightly optimized parameters:
loessFit <- loess(y~x, dat, span = 0.6)
loessFit <- data.frame(x=loessFit$x,y=loessFit$fitted)
loessFit <- loessFit[order(loessFit$x),]
approxFit <- approx(dat,n = 15)
lowessFit <-data.frame(lowess(dat,f = .6,iter=1))
And the results:
plot(dat,col='gray')
curve(sigmoid,0,200,add=TRUE,col='blue',)
lines(lowessFit,col='red')
lines(loessFit,col='green')
lines(approxFit,col='purple')
legend(150,.6,
legend=c("Sigmoid","Loess","Lowess",'Approx'),
lty=c(1,1),
lwd=c(2.5,2.5),col=c("blue","green","red","purple"))
As you can see, lowess
produces a near perfect fit to the original generating curve. Loess
is close, but experiences a strange deviation at both tails.
Although your dataset will be very different, I have found that other datasets perform similarly, with both loess
and lowess
capable of producing good results. The differences become more significant when you look at benchmarks:
> microbenchmark::microbenchmark(loess(y~x, dat, span = 0.6),approx(dat,n = 20),lowess(dat,f = .6,iter=1),times=20)
Unit: milliseconds
expr min lq mean median uq max neval cld
loess(y ~ x, dat, span = 0.6) 153.034810 154.450750 156.794257 156.004357 159.23183 163.117746 20 c
approx(dat, n = 20) 1.297685 1.346773 1.689133 1.441823 1.86018 4.281735 20 a
lowess(dat, f = 0.6, iter = 1) 9.637583 10.085613 11.270911 11.350722 12.33046 12.495343 20 b
Loess
is extremely slow, taking 100x as long as approx
. Lowess
produces better results than approx
, while still running fairly quickly (15x faster than loess).
Loess
also becomes increasingly bogged down as the number of points increases, becoming unusable around 50,000.
EDIT: Additional research shows that loess
gives better fits for certain datasets. If you are dealing with a small dataset or performance is not a consideration, try both functions and compare the results.