Pandas - Compute z-score for all columns

Question

I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores  Here s a subsection of it   ID      Age    BMI    Risk Factor PT 6    48     19 3    4 PT 8    43     20 9    NaN PT 2    39     18 1    3 PT 9    41     19 5    NaN   Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question  how to zscore normalize pandas column with nans   df  zscore      df a - df a mean    df a std ddof 0    I m interested in applying this solution to all of my columns except the ID column to produce a new dataframe which I can save as an Excel file using  df2 to excel  Z-Scores xlsx     So basically  how can I compute z-scores for each column  ignoring NaN values  and push everything into a new dataframe   SIDENOTE  there is a concept in pandas called  indexing  which intimidates me because I do not understand it well  If indexing is a crucial part of solving this problem  please dumb down your explanation of indexing

User · Answer

for Z score  we can stick to documentation instead of using  apply  function from scipy stats import zscore df zscore   zscore cols as array  axis 1

User · Answer

Using Scipy s zscore function   df   pd DataFrame np random randint 100  200  size  5  3    columns   A    B    C    df           A     B     C    ---  ----  ----  ----      0   163   163   159      1   120   153   181      2   130   199   108      3   108   188   157      4   109   171   119    from scipy stats import zscore df apply zscore                  A           B           C    ---  ----------  ----------  ----------      0    1 83447    -0 708023    0 523362      1   -0 297482   -1 30804     1 3342        2    0 198321    1 45205    -1 35632       3   -0 892446    0 792025    0 449649      4   -0 842866   -0 228007   -0 950897     If not all the columns of your data frame are numeric  then you can apply the Z-score function only to the numeric columns using the select dtypes function     Note that  select dtypes  returns a data frame  We are selecting only the columns numeric cols   df select dtypes include  np number   columns df numeric cols  apply zscore                  A           B           C    ---  ----------  ----------  ----------      0    1 83447    -0 708023    0 523362      1   -0 297482   -1 30804     1 3342        2    0 198321    1 45205    -1 35632       3   -0 892446    0 792025    0 449649      4   -0 842866   -0 228007   -0 950897

User · Answer

Build a list from the columns and remove the column you don t want to calculate the Z score for   In  66   cols   list df columns  cols remove  ID   df cols   Out 66      Age  BMI  Risk  Factor 0    6   48  19 3       4 1    8   43  20 9     NaN 2    2   39  18 1       3 3    9   41  19 5     NaN In  68     now iterate over the remaining columns and create a new zscore column for col in cols      col zscore   col     zscore      df col zscore     df col  - df col  mean    df col  std ddof 0  df Out 68      ID  Age  BMI  Risk  Factor  Age zscore  BMI zscore  Risk zscore    0  PT    6   48  19 3       4   -0 093250    1 569614    -0 150946    1  PT    8   43  20 9     NaN    0 652753    0 074744     1 459148    2  PT    2   39  18 1       3   -1 585258   -1 121153    -1 358517    3  PT    9   41  19 5     NaN    1 025755   -0 523205     0 050315        Factor zscore   0              1   1            NaN   2             -1   3            NaN

User · Answer

When we are dealing with time-series  calculating z-scores  or anomalies - not the same thing  but you can adapt this code easily  is a bit more complicated  For example  you have 10 years of temperature data measured weekly  To calculate z-scores for the whole time-series  you have to know the means and standard deviations for each day of the year  So  let s get started   Assume you have a pandas DataFrame  First of all  you need a DateTime index  If you don t have it yet  but luckily you do have a column with dates  just make it as your index  Pandas will try to guess the date format  The goal here is to have DateTimeIndex  You can check it out by trying   type df index    If you don t have one  let s make it   df index   pd DatetimeIndex df datecolumn   df   df drop datecolumn axis 1    Next step is to calculate mean and standard deviation for each group of days  For this  we use the groupby method   mean   pd groupby df by  df index dayofyear   aggregate np nanmean  std   pd groupby df by  df index dayofyear   aggregate np nanstd    Finally  we loop through all the dates  performing the calculation  value - mean  stddev  however  as mentioned  for time-series this is not so straightforward   df2   df copy    keep a copy for future comparisons  for y in np unique df index year       for d in np unique df index dayofyear           df2  df index year  y   amp   df index dayofyear  d      df  df index year  y   amp   df index dayofyear  d  - mean ix d   std ix d          df2 index name    date   this is just to look nicer  df2  this is your z-score dataset    The logic inside the for loops is  for a given year we have to match each dayofyear to its mean and stdev  We run this for all the years in your time-series

User · Answer

To calculate a z-score for an entire column quickly  do as follows  from scipy stats import zscore import pandas as pd  df   pd DataFrame   num 1    1 2 3 4 5 6 7 8 9 3 4 6 5 7 3 2 9    df  num 1 zscore     zscore df  num 1     display df

User · Answer

If you want to calculate the zscore for all of the columns  you can just use the following    df zscore    df - df mean    df std

User · Answer

Here s other way of getting Zscore using custom function   In  6   import pandas as pd  import numpy as np  In  7   np random seed 0    Fixes the random seed  In  8   df   pd DataFrame np random randn 5 3   columns   randomA    randomB   randomC     In  9   df   watch output of dataframe Out 9       randomA   randomB   randomC 0  1 764052  0 400157  0 978738 1  2 240893  1 867558 -0 977278 2  0 950088 -0 151357 -0 103219 3  0 410599  0 144044  1 454274 4  0 761038  0 121675  0 443863     Create custom function to compute Zscore  In  10   def z score df                    df columns    x     zscore  for x in df columns tolist                     return   df - df mean    df std ddof 0                make sure you filter or select columns of interest before passing dataframe to function In  11   z score df    compute Zscore Out 11      randomA zscore  randomB zscore  randomC zscore 0        0 798350       -0 106335        0 731041 1        1 505002        1 939828       -1 577295 2       -0 407899       -0 875374       -0 545799 3       -1 207392       -0 463464        1 292230 4       -0 688061       -0 494655        0 099824   Result reproduced using scipy stats zscore  In  12   from scipy stats import zscore  In  13   df apply zscore     Credit  Manuel  Out 13       randomA   randomB   randomC 0  0 798350 -0 106335  0 731041 1  1 505002  1 939828 -1 577295 2 -0 407899 -0 875374 -0 545799 3 -1 207392 -0 463464  1 292230 4 -0 688061 -0 494655  0 099824

User · Answer

The almost one-liner solution   df2    df ix   1   - df ix   1   mean      df ix   1   std   df2  ID     df  ID

[python] Pandas - Compute z-score for all columns

Examples related to python

Examples related to pandas

Examples related to indexing

Examples related to statistics