Fitting empirical distribution to theoretical ones with Scipy Python

Question

INTRODUCTION  I have a list of more than 30 000 integer values ranging from 0 to 47  inclusive  e g  0 0 0 0    1 1 1 1     2 2 2 2     47 47 47      sampled from some continuous distribution  The values in the list are not necessarily in order  but order doesn t matter for this problem   PROBLEM  Based on my distribution I would like to calculate p-value  the probability of seeing greater values  for any given value  For example  as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0   I don t know if I am right  but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data  I assume that some kind of goodness of fit test is needed to determine the best model   Is there a way to implement such an analysis in Python  Scipy or Numpy   Could you present any examples   Thank you

User · Answer

Try the distfit library   pip install distfit    Create 1000 random integers  value between  0-50  X   np random randint 0  50 1000     Retrieve P-value for y y    0 10 45 55 100     From the distfit library import the class distfit from distfit import distfit    Initialize    Set any properties here  such as alpha    The smoothing can be of use when working with integers  Otherwise your histogram   may be jumping up-and-down  and getting the correct fit may be harder  dist   distfit alpha 0 05  smooth 10     Search for best theoretical fit on your empirical data dist fit transform X    gt   distfit   gt fit    gt   distfit   gt transform    gt   distfit   gt  norm         RSS  0 0037894   loc 23 535 scale 14 450    gt   distfit   gt  expon        RSS  0 0055534   loc 0 000 scale 23 535    gt   distfit   gt  pareto       RSS  0 0056828   loc -384473077 778 scale 384473077 778    gt   distfit   gt  dweibull     RSS  0 0038202   loc 24 535 scale 13 936    gt   distfit   gt  t            RSS  0 0037896   loc 23 535 scale 14 450    gt   distfit   gt  genextreme   RSS  0 0036185   loc 18 890 scale 14 506    gt   distfit   gt  gamma        RSS  0 0037600   loc -175 505 scale 1 044    gt   distfit   gt  lognorm      RSS  0 0642364   loc -0 000 scale 1 802    gt   distfit   gt  beta         RSS  0 0021885   loc -3 981 scale 52 981    gt   distfit   gt  uniform      RSS  0 0012349   loc 0 000 scale 49 000      Best fitted model best distr   dist model print best distr     Uniform shows best fit  with 95  CII  confidence intervals   and all other parameters  gt    distr    lt scipy stats  continuous distns uniform gen at 0x16de3a53160 gt    gt    params    0 0  49 0    gt    name    uniform    gt    RSS   0 0012349021241149533   gt    loc   0 0   gt    scale   49 0   gt    arg        gt    CII min alpha   2 45   gt    CII max alpha   46 55     Ranking distributions dist summary    Plot the summary of fitted distributions dist plot summary         Make prediction on new datapoints based on the fit dist predict y     Retrieve your pvalues with  dist y pred   array   down    none    none    up    up    dtype   lt U4   dist y proba array  0 02040816  0 02040816  0 02040816  0           0               Or in one dataframe dist df    The plot function will now also include the predictions of y dist plot       Note that in this case  all points will be significant because of the uniform distribution  You can filter with the dist y pred if required

User · Answer

The following code is the version of the general answer but with corrections and clarity  import numpy as np import pandas as pd import scipy stats as st import statsmodels api as sm import matplotlib as mpl import matplotlib pyplot as plt import math import random  mpl style use  quot ggplot quot    def danoes formula data        quot  quot  quot      DANOE S FORMULA     https   en wikipedia org wiki Histogram Doane s formula      quot  quot  quot      N   len data      skewness   st skew data      sigma g1   math sqrt  6  N-2     N 1   N 3        num bins   1   math log N 2    math log 1 abs skewness  sigma g1 2      num bins   round num bins      return num bins  def plot histogram data  results  n          n first distribution of the ranking     N DISTRIBUTIONS    k  results k  for k in list results   n           Histogram of data     plt figure figsize  10  5       plt hist data  density True  ec  white   color  63 235  149 235  170 235       plt title  HISTOGRAM       plt xlabel  Values       plt ylabel  Frequencies           Plot n distributions     for distribution  result in N DISTRIBUTIONS items              print i  distribution          sse   result 0          arg   result 1          loc   result 2          scale   result 3          x plot   np linspace min data   max data   1000          y plot   distribution pdf x plot  loc loc  scale scale   arg          plt plot x plot  y plot  label str distribution  32 -34     quot    quot    str sse  0 6   color  random uniform 0  1   random uniform 0  1   random uniform 0  1             plt legend title  DISTRIBUTIONS   bbox to anchor  1 05  1   loc  upper left       plt show    def fit data data          st frechet r st frechet l  are disbled in current SciPy version        st levy stable  a lot of time of estimation parameters     ALL DISTRIBUTIONS                     st alpha st anglit st arcsine st beta st betaprime st bradford st burr st cauchy st chi st chi2 st cosine          st dgamma st dweibull st erlang st expon st exponnorm st exponweib st exponpow st f st fatiguelife st fisk          st foldcauchy st foldnorm  st genlogistic st genpareto st gennorm st genexpon          st genextreme st gausshyper st gamma st gengamma st genhalflogistic st gilbrat st gompertz st gumbel r          st gumbel l st halfcauchy st halflogistic st halfnorm st halfgennorm st hypsecant st invgamma st invgauss          st invweibull st johnsonsb st johnsonsu st ksone st kstwobign st laplace st levy st levy l          st logistic st loggamma st loglaplace st lognorm st lomax st maxwell st mielke st nakagami st ncx2 st ncf          st nct st norm st pareto st pearson3 st powerlaw st powerlognorm st powernorm st rdist st reciprocal          st rayleigh st rice st recipinvgauss st semicircular st t st triang st truncexpon st truncnorm st tukeylambda          st uniform st vonmises st vonmises line st wald st weibull min st weibull max st wrapcauchy                MY DISTRIBUTIONS    st beta  st expon  st norm  st uniform  st johnsonsb  st gennorm  st gausshyper          Calculae Histogram     num bins   danoes formula data      frequencies  bin edges   np histogram data  num bins  density True      central values     bin edges i    bin edges i 1   2 for i in range len bin edges -1        results          for distribution in MY DISTRIBUTIONS             Get parameters of distribution         params   distribution fit data                      Separate parts of parameters         arg   params  -2          loc   params -2          scale   params -1                  Calculate fitted PDF and error with fit in distribution         pdf values    distribution pdf c  loc loc  scale scale   arg  for c in central values                      Calculate SSE  sum of squared estimate of errors          sse   np sum np power frequencies - pdf values  2 0                       Build results and sort by sse         results distribution     sse  arg  loc  scale               results    k  results k  for k in sorted results  key results get       return results          def main           Import data     data   pd Series sm datasets elnino load pandas   data set index  YEAR   values ravel        results   fit data data      plot histogram data  results  5   if   name       quot   main   quot       main

User · Answer

AFAICU  your distribution is discrete  and nothing but discrete   Therefore just counting the frequencies of different values and normalizing them should be enough for your purposes  So  an example to demonstrate this   In     values   0  0  0  0  0  1  1  1  1  2  2  2  3  3  4  In     counts  asarray bincount values   dtype  float  In     cdf  counts cumsum    counts sum     Thus  probability of seeing values higher than 1 is simply  according to the complementary cumulative distribution function  ccdf    In     1- cdf 1  Out    0 40000000000000002   Please note that ccdf is closely related to survival function  sf   but it s also defined with discrete distributions  whereas sf is defined only for contiguous distributions

User · Answer

Forgive me if I don t understand your need but what about storing your data in a dictionary where keys would be the numbers between 0 and 47 and values the number of occurrences of their related keys in your original list  Thus your likelihood p x  will be the sum of all the values for keys greater than x divided by 30000

User · Answer

Distribution Fitting with Sum of Square Error  SSE   This is an update and modification to Saullo s answer  that uses the full list of the current scipy stats distributions and returns the distribution with the least SSE between the distribution s histogram and the data s histogram   Example Fitting  Using the El Ni  o dataset from statsmodels  the distributions are fit and error is determined  The distribution with the least error is returned     All Distributions    Best Fit Distribution    Example Code   matplotlib inline  import warnings import numpy as np import pandas as pd import scipy stats as st import statsmodels as sm import matplotlib import matplotlib pyplot as plt  matplotlib rcParams  figure figsize      16 0  12 0  matplotlib style use  ggplot      Create models from data def best fit distribution data  bins 200  ax None          Model data by finding best fit distribution to data          Get histogram of original data     y  x   np histogram data  bins bins  density True      x    x   np roll x  -1    -1    2 0        Distributions to check     DISTRIBUTIONS                     st alpha st anglit st arcsine st beta st betaprime st bradford st burr st cauchy st chi st chi2 st cosine          st dgamma st dweibull st erlang st expon st exponnorm st exponweib st exponpow st f st fatiguelife st fisk          st foldcauchy st foldnorm st frechet r st frechet l st genlogistic st genpareto st gennorm st genexpon          st genextreme st gausshyper st gamma st gengamma st genhalflogistic st gilbrat st gompertz st gumbel r          st gumbel l st halfcauchy st halflogistic st halfnorm st halfgennorm st hypsecant st invgamma st invgauss          st invweibull st johnsonsb st johnsonsu st ksone st kstwobign st laplace st levy st levy l st levy stable          st logistic st loggamma st loglaplace st lognorm st lomax st maxwell st mielke st nakagami st ncx2 st ncf          st nct st norm st pareto st pearson3 st powerlaw st powerlognorm st powernorm st rdist st reciprocal          st rayleigh st rice st recipinvgauss st semicircular st t st triang st truncexpon st truncnorm st tukeylambda          st uniform st vonmises st vonmises line st wald st weibull min st weibull max st wrapcauchy              Best holders     best distribution   st norm     best params    0 0  1 0      best sse   np inf        Estimate distribution parameters from data     for distribution in DISTRIBUTIONS             Try to fit the distribution         try                Ignore warnings from data that can t be fit             with warnings catch warnings                    warnings filterwarnings  ignore                      fit dist to data                 params   distribution fit data                     Separate parts of parameters                 arg   params  -2                  loc   params -2                  scale   params -1                     Calculate fitted PDF and error with fit in distribution                 pdf   distribution pdf x  loc loc  scale scale   arg                  sse   np sum np power y - pdf  2 0                      if axis pass in add to plot                 try                      if ax                          pd Series pdf  x  plot ax ax                      end                 except Exception                      pass                    identify if this distribution is better                 if best sse  gt  sse  gt  0                      best distribution   distribution                     best params   params                     best sse   sse          except Exception              pass      return  best distribution name  best params   def make pdf dist  params  size 10000          Generate distributions s Probability Distribution Function            Separate parts of parameters     arg   params  -2      loc   params -2      scale   params -1         Get sane start and end points of distribution     start   dist ppf 0 01   arg  loc loc  scale scale  if arg else dist ppf 0 01  loc loc  scale scale      end   dist ppf 0 99   arg  loc loc  scale scale  if arg else dist ppf 0 99  loc loc  scale scale         Build PDF and turn into pandas Series     x   np linspace start  end  size      y   dist pdf x  loc loc  scale scale   arg      pdf   pd Series y  x       return pdf    Load data from statsmodels datasets data   pd Series sm datasets elnino load pandas   data set index  YEAR   values ravel       Plot for comparison plt figure figsize  12 8   ax   data plot kind  hist   bins 50  normed True  alpha 0 5  color plt rcParams  axes color cycle   1     Save plot limits dataYLim   ax get ylim      Find best fit distribution best fit name  best fit params   best fit distribution data  200  ax  best dist   getattr st  best fit name     Update plots ax set ylim dataYLim  ax set title u El Ni  o sea temp  n All Fitted Distributions   ax set xlabel u Temp    C    ax set ylabel  Frequency      Make PDF with best params  pdf   make pdf best dist  best fit params     Display plt figure figsize  12 8   ax   pdf plot lw 2  label  PDF   legend True  data plot kind  hist   bins 50  normed True  alpha 0 5  label  Data   legend True  ax ax   param names    best dist shapes      loc  scale   split       if best dist shapes else   loc    scale   param str        join        0 2f   format k v  for k v in zip param names  best fit params    dist str            format best fit name  param str   ax set title u El Ni  o sea temp  with best fit distribution  n    dist str  ax set xlabel u Temp     C    ax set ylabel  Frequency

User · Answer

With OpenTURNS  I would use the BIC criteria to select the best distribution that fits such data  This is because this criteria does not give too much advantage to the distributions which have more parameters  Indeed  if a distribution has more parameters  it is easier for the fitted distribution to be closer to the data  Moreover  the Kolmogorov-Smirnov may not make sense in this case  because a small error in the measured values will have a huge impact on the p-value  To illustrate the process  I load the El-Nino data  which contains 732 monthly temperature measurements from 1950 to 2010  import statsmodels api as sm dta   sm datasets elnino load pandas   data dta  YEAR     dta YEAR astype int  astype str  dta   dta set index  YEAR   T unstack   data   dta values  It is easy to get the 30 of built-in univariate factories of distributions with the GetContinuousUniVariateFactories static method  Once done  the BestModelBIC static method returns the best model and the corresponding BIC score  sample   ot Sample   p  for p in data     data reshaping tested factories   ot DistributionFactory GetContinuousUniVariateFactories   best model  best bic   ot FittingTest BestModelBIC sample                                                     tested factories  print  quot Best  quot  best model   which prints  Best  Beta alpha   1 64258  beta   2 4348  a   18 936  b   29 254   In order to graphically compare the fit to the histogram  I use the drawPDF methods of the best distribution  import openturns viewer as otv graph   ot HistogramFactory   build sample  drawPDF   bestPDF   best model drawPDF   bestPDF setColors   quot blue quot    graph add bestPDF  graph setTitle  quot Best BIC fit quot   name   best model getImplementation   getClassName   graph setLegends   quot Histogram quot  name   graph setXTitle  quot Temperature    C  quot   otv View graph   This produces   More details on this topic are presented in the BestModelBIC doc  It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution  but I guess that the current script fulfills most practical purposes

User · Answer

While many of the above answers are completely valid  no one seems to answer your question completely  specifically the part   I don t know if I am right  but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data  I assume that some kind of goodness of fit test is needed to determine the best model   The parametric approach This is the process you re describing of using some theoretical distribution and fitting the parameters to your data and there s some excellent answers how to do this  The non-parametric approach However  it s also possible to use a non-parametric approach to your problem  which means you do not assume any underlying distribution at all  By using the so-called Empirical distribution function which equals  Fn x   SUM  I X lt  x      n  So the proportion of values below x  As was pointed out in one of the above answers is that what you re interested in is the inverse CDF  cumulative distribution function   which is equal to 1-F x  It can be shown that the empirical distribution function will converge to whatever  true  CDF that generated your data  Furthermore  it is straightforward to construct a 1-alpha confidence interval by  L X    max Fn x -en  0  U X    min Fn x  en  0  en   sqrt   1 2n  log 2 alpha   Then P  L X   lt   F X   lt   U X     gt   1-alpha for all x  I m quite surprised that PierrOz answer has 0 votes  while it s a completely valid answer to the question using a non-parametric approach to estimating F x   Note that the issue you mention of P X gt  x  0 for any x gt 47 is simply a personal preference that might lead you to chose the parametric approach above the non-parametric approach  Both approaches however are completely valid solutions to your problem  For more details and proofs of the above statements I would recommend having a look at  All of Statistics  A Concise Course in Statistical Inference by Larry A  Wasserman   An excellent book on both parametric and non-parametric inference  EDIT  Since you specifically asked for some python examples it can be done using numpy  import numpy as np  def empirical cdf data  x       return np sum x lt  data  len data   def p value data  x       return 1-empirical cdf data  x     Generate some data for demonstration purposes data   np floor np random uniform low 0  high 48  size 30000    print empirical cdf data  20   print p value data  20     This is the value you re interested in

User · Answer

There are more than 90 implemented distribution functions in SciPy v1 6 0  You can test how some of them fit to your data using their fit   method  Check the code below for more details   import matplotlib pyplot as plt import numpy as np import scipy import scipy stats size   30000 x   np arange size  y   scipy int  np round  scipy stats vonmises rvs 5 size size  47   h   plt hist y  bins range 48    dist names     gamma    beta    rayleigh    norm    pareto    for dist name in dist names      dist   getattr scipy stats  dist name      params   dist fit y      arg   params  -2      loc   params -2      scale   params -1      if arg          pdf fitted   dist pdf x   arg  loc loc  scale scale    size     else          pdf fitted   dist pdf x  loc loc  scale loc    size     plt plot pdf fitted  label dist name      plt xlim 0 47  plt legend loc  upper right   plt show    References  - Fitting distributions  goodness of fit  p-value  Is it possible to do this with Scipy  Python   - Distribution fitting with Scipy And here a list with the names of all distribution functions available in Scipy 0 12 0  VI   dist names      alpha    anglit    arcsine    beta    betaprime    bradford    burr    cauchy    chi    chi2    cosine    dgamma    dweibull    erlang    expon    exponweib    exponpow    f    fatiguelife    fisk    foldcauchy    foldnorm    frechet r    frechet l    genlogistic    genpareto    genexpon    genextreme    gausshyper    gamma    gengamma    genhalflogistic    gilbrat    gompertz    gumbel r    gumbel l    halfcauchy    halflogistic    halfnorm    hypsecant    invgamma    invgauss    invweibull    johnsonsb    johnsonsu    ksone    kstwobign    laplace    logistic    loggamma    loglaplace    lognorm    lomax    maxwell    mielke    nakagami    ncx2    ncf    nct    norm    pareto    pearson3    powerlaw    powerlognorm    powernorm    rdist    reciprocal    rayleigh    rice    recipinvgauss    semicircular    t    triang    truncexpon    truncnorm    tukeylambda    uniform    vonmises    wald    weibull min    weibull max    wrapcauchy

User · Answer

It sounds like probability density estimation problem to me   from scipy stats import gaussian kde occurences    0 0 0 0    1 1 1 1     2 2 2 2     47  values   range 0 48  kde   gaussian kde map float  occurences   p   kde values  p   p sum p  print  P x gt  1     f    sum p 1      Also see http   jpktd blogspot com 2009 03 using-gaussian-kernel-density html

User · Answer

fit   method mentioned by  Saullo Castro provides maximum likelihood estimates  MLE    The best distribution for your data is the one give you the highest can be determined by several different ways  such as  1  the one that gives you the highest log likelihood   2  the one that gives you the smallest AIC  BIC or BICc values  see wiki  http   en wikipedia org wiki Akaike information criterion  basically can be viewed as log likelihood adjusted for number of parameters  as distribution with more parameters are expected to fit better   3  the one that maximize the Bayesian posterior probability   see wiki  http   en wikipedia org wiki Posterior probability   Of course  if you already have a distribution that should describe you data  based on the theories in your particular field  and want to stick to that  you will skip the step of identifying the best fit distribution   scipy does not come with a function to calculate log likelihood  although MLE method is provided   but hard code one is easy  see Is the build-in probability density functions of  scipy stat distributions  slower than a user provided one

[python] Fitting empirical distribution to theoretical ones with Scipy (Python)?

Examples related to python

Examples related to numpy

Examples related to statistics

Examples related to scipy

Examples related to distribution