How to efficiently calculate a running standard deviation

Question

I have an array of lists of numbers  e g     0   0 01  0 01  0 02  0 04  0 03   1   0 00  0 02  0 02  0 03  0 02   2   0 01  0 02  0 02  0 03  0 02            n   0 01  0 00  0 01  0 05  0 03    What I would like to do is efficiently calculate the mean and standard deviation at each index of a list  across all array elements   To do the mean  I have been looping through the array and summing the value at a given index of a list  At the end  I divide each value in my  averages list  by n  I am working with a population  not a sample from the population    To do the standard deviation  I loop through again  now that I have the mean calculated   I would like to avoid going through the array twice  once for the mean and then once for the SD  after I have a mean     Is there an efficient method for calculating both values  only going through the array once  Any code in an interpreted language  e g  Perl or Python  or pseudocode is fine

User · Answer

As the following answer describes  Does pandas scipy numpy provide a cumulative standard deviation function  The Python Pandas module contains a method to calculate the running or cumulative standard deviation  For that you ll have to convert your data into a pandas dataframe  or a series if it is 1D   but there are functions for that

User · Answer

I think this issue will help you  Standard deviation

User · Answer

Here s a  one-liner   spread over multiple lines  in functional programming style   def variance data  opt 0       return  lambda  m2  i      m2    opt   i - 1            reduce              lambda  m2  i  avg   x                                m2    x - avg     2   i    i   1                   i   1                  avg    x - avg     i   1                             data               0  0  0

User · Answer

The basic answer is to accumulate the sum of both x  call it  sum x1   and x2  call it  sum x2   as you go   The value of the standard deviation is then   stdev   sqrt  sum x2   n  -  mean   mean      where  mean   sum x   n   This is the sample standard deviation  you get the population standard deviation using  n  instead of  n - 1  as the divisor   You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples   Go to the external references in other answers  Wikipedia  etc  for more information

User · Answer

Perhaps not what you were asking  but     If you use a numpy array  it will do the work for you  efficiently   from numpy import array  nums   array   0 01  0 01  0 02  0 04  0 03                  0 00  0 02  0 02  0 03  0 02                  0 01  0 02  0 02  0 03  0 02                  0 01  0 00  0 01  0 05  0 03     print nums std axis 1      0 0116619   0 00979796  0 00632456  0 01788854   print nums mean axis 1      0 022  0 018  0 02   0 02     By the way  there s some interesting discussion in this blog post and comments on one-pass methods for computing means and variances    http   lingpipe-blog com 2009 03 19 computing-sample-mean-variance-online-one-pass

User · Answer

You could look at the Wikipedia article on Standard Deviation  in particular the section about Rapid calculation methods   There s also an article I found that uses Python  you should be able to use the code in it without much change  Subliminal Messages - Running Standard Deviations

User · Answer

Statistics  Descriptive is a very decent Perl module for these types of calculations      usr bin perl  use strict  use warnings   use Statistics  Descriptive qw   all     my  data           0 01  0 01  0 02  0 04  0 03          0 00  0 02  0 02  0 03  0 02          0 01  0 02  0 02  0 03  0 02          0 01  0 00  0 01  0 05  0 03        my  stat   Statistics  Descriptive  Full- gt new    You also have the option of using sparse data structures  for my  ref     data          stat- gt add data    ref        printf  Running mean   f n    stat- gt mean      printf  Running stdev   f n    stat- gt standard deviation      END     Output   C  Temp gt  g Running mean  0 022000 Running stdev  0 013038 Running mean  0 020000 Running stdev  0 011547 Running mean  0 020000 Running stdev  0 010000 Running mean  0 020000 Running stdev  0 012566

User · Answer

Here is a literal pure Python translation of the Welford s algorithm implementation from http   www johndcook com standard deviation html   https   github com liyanage python-modules blob master running stats py  import math  class RunningStats       def   init   self           self n   0         self old m   0         self new m   0         self old s   0         self new s   0      def clear self           self n   0      def push self  x           self n    1          if self n    1              self old m   self new m   x             self old s   0         else              self new m   self old m    x - self old m    self n             self new s   self old s    x - self old m     x - self new m               self old m   self new m             self old s   self new s      def mean self           return self new m if self n else 0 0      def variance self           return self new s    self n - 1  if self n  gt  1 else 0 0      def standard deviation self           return math sqrt self variance      Usage   rs   RunningStats   rs push 17 0  rs push 19 0  rs push 24 0   mean   rs mean   variance   rs variance   stdev   rs standard deviation    print f Mean   mean   Variance   variance   Std  Dev    stdev

User · Answer

I like to express the update this way   def running update x  N  mu  var                    arg x  the current data sample          arg N   the number of previous samples          arg mu  the mean of the previous samples          arg var   the variance over the previous samples          retval  N 1  mu   var   -- updated mean  variance and count             N   N   1     rho   1 0 N     d   x - mu     mu    rho d     var    rho   1-rho  d  2 - var      return  N  mu  var    so that a one-pass function would look like this   def one pass data       N   0     mu   0 0     var   0 0     for x in data          N   N   1         rho   1 0 N         d   x - mu         mu    rho d         var    rho   1-rho  d  2 - var            could yield here if you want partial results    return  N  mu  var    note that this is calculating the sample variance  1 N   not the unbiased estimate of the population variance  which uses a 1  N-1  normalzation factor    Unlike the other answers  the variable  var  that is tracking the running variance does not grow in proportion to the number of samples  At all times it is just the variance of the set of samples seen so far  there is no final  dividing by n  in getting the variance    In a class it would look like this   class RunningMeanVar object       def   init   self           self N   0         self mu   0 0         self var   0 0     def push self  x           self N   self N   1         rho   1 0 N         d   x-self mu         self mu    rho d         self var      rho   1-rho  d  2-self var        reset  accessors etc  can be setup as you see fit   This also works for weighted samples   def running update w  x  N  mu  var                    arg w  the weight of the current sample          arg x  the current data sample          arg mu  the mean of the previous N sample          arg var   the variance over the previous N samples          arg N   the number of previous samples          retval  N w  mu   var   -- updated mean  variance and count             N   N   w     rho   w N     d   x - mu     mu    rho d     var    rho   1-rho  d  2 - var      return  N  mu  var

User · Answer

Here is a practical example of how you could implement a running standard deviation with python and numpy  a   np arange 1  10  s   0 s2   0 for i in range 0  len a        s    a i      s2    a i     2      n    i   1      m   s   n     std   np sqrt  s2   n  -  m   m       print std  np std a  i   1     This will print out the calculated standard deviation and a check standard deviation calculated with numpy   0 0 0 0 0 5 0 5 0 8164965809277263 0 816496580927726 1 118033988749895 1 118033988749895 1 4142135623730951 1 4142135623730951 1 707825127659933 1 707825127659933 2 0 2 0 2 29128784747792 2 29128784747792 2 5819888974716116 2 581988897471611   I am just using the formula described in this thread  stdev   sqrt  sum x2   n  -  mean   mean

User · Answer

n int raw input  Enter no  of terms      L     for i in range  1 n 1        x float raw input  Enter term          L append x   sum 0  for i in range n        sum sum L i   avg sum n  sumdev 0  for j in range n        sumdev sumdev  L j -avg   2  dev  sumdev n   0 5  print  Standard deviation is   dev

User · Answer

The Python runstats Module is for just this sort of thing  Install runstats from PyPI   pip install runstats   Runstats summaries can produce the mean  variance  standard deviation  skewness  and kurtosis  in a single pass of data  We can use this to create your  running  version   from runstats import Statistics  stats    Statistics   for num in range len data 0      for row in data       for index  val in enumerate row           stats index  push val       for index  stat in enumerate stats           print  Index   index   mean    stat mean           print  Index   index   standard deviation    stat stddev     Statistics summaries are based on the Knuth and Welford method for computing standard deviation in one pass as described in the Art of Computer Programming  Vol 2  p  232  3rd edition  The benefit of this is numerically stable and accurate results   Disclaimer  I am the author the Python runstats module

User · Answer

How big is your array  Unless it is zillions of elements long  don t worry about looping through it twice  The code is simple and easily tested   My preference would be to use the numpy array maths extension to convert your array of arrays into a numpy 2D array and get the standard deviation directly    gt  gt  gt  x       1  2  4  3  4  5      3  4  5  6  7  8       10  gt  gt  gt  import numpy  gt  gt  gt  a   numpy array x   gt  gt  gt  a std axis 0   array   1     1     0 5   1 5   1 5   1 5    gt  gt  gt  a mean axis 0  array   2     3     4 5   4 5   5 5   6 5     If that s not an option and you need a pure Python solution  keep reading     If your array is   x              1  2  4  3  4  5            3  4  5  6  7  8                   Then the standard deviation is   d   len x 0   n   len x  sum x     sum v i  for v in x  for i in range d    sum x2     sum v i   2 for v in x  for i in range d    std dev     sqrt  sx2 - sx  2  N   for sx  sx2 in zip sum x  sum x2      If you are determined to loop through your array only once  the running sums can be combined   sum x      0     d sum x2     0     d for v in x     for i  t in enumerate v      sum x i     t    sum x2 i     t  2   This isn t nearly as elegant as the list comprehension solution above

User · Answer

The answer is to use Welford s algorithm  which is very clearly defined after the  naive methods  in    Wikipedia  Algorithms for calculating variance   It s more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses   The stability only really matters when you have lots of values that are close to each other as they lead to what is known as  catastrophic cancellation  in the floating point literature   You might also want to brush up on the difference between dividing by the number of samples  N  and N-1 in the variance calculation  squared deviation    Dividing by N-1 leads to an unbiased estimate of variance from the sample  whereas dividing by N on average underestimates variance  because it doesn t take into account the variance between the sample mean and the true mean    I wrote two blog entries on the topic which go into more details  including how to delete previous values online    Computing Sample Mean and Variance Online in One Pass Deleting Values in Welford   s Algorithm for Online Mean and Variance   You can also take a look at my Java implement  the javadoc  source  and unit tests are all online    Javadoc  stats OnlineNormalEstimator Source  stats OnlineNormalEstimator java JUnit Source  test unit stats OnlineNormalEstimatorTest java LingPipe Home Page

User · Answer

Have a look at PDL  pronounced  piddle       This is the Perl Data Language which is designed for high precision mathematics and scientific computing   Here is an example using your figures      use strict  use warnings  use PDL   my  figs   pdl        0 01  0 01  0 02  0 04  0 03        0 00  0 02  0 02  0 03  0 02        0 01  0 02  0 02  0 03  0 02        0 01  0 00  0 01  0 05  0 03       my    mean   prms   median   min   max   adev   rms     statsover   figs     say  Mean scores          mean  say  Std dev   adev       adev  say  Std dev   prms       prms  say  Std dev   rms        rms     Which produces   Mean scores       0 022 0 018 0 02 0 02  Std dev   adev    0 0104 0 0072 0 004 0 016  Std dev   prms    0 013038405 0 010954451 0 0070710678 0 02  Std dev   rms     0 011661904 0 009797959 0 0063245553 0 017888544     Have a look at PDL  Primitive for more information on the statsover function   This seems to suggest that ADEV is the  standard deviation      However it maybe PRMS  which Sinan s Statistics  Descriptive example show  or RMS  which ars s NumPy example shows    I guess one of these three must be right  -   For more PDL information have a look at    pdl perl org   official PDL page   PDL quick reference guide on PerlMonks Dr  Dobb s article on PDL PDL Wiki Wikipedia entry for PDL Sourceforge project page for PDL

[python] How to efficiently calculate a running standard deviation?

Examples related to python

Examples related to perl

Examples related to statistics