Normalize data in pandas

Question

Suppose I have a pandas data frame df    I want to calculate the column wise mean of a data frame   This is easy    df apply average     then the column wise range max col  - min col   This is easy again    df apply max  - df apply min    Now for each element I want to subtract its column s mean and divide by its column s range  I am not sure how to do that  Any help pointers are much appreciated

User · Answer

If you don t mind importing the sklearn library  I would recommend the method talked on this blog   import pandas as pd from sklearn import preprocessing  data     score    234 24 14 27 -74 46 73 -18 59 160   cols   data columns df   pd DataFrame data  df  min max scaler   preprocessing MinMaxScaler   np scaled   min max scaler fit transform df  df normalized   pd DataFrame np scaled  columns   cols  df normalized

User · Answer

This is how you do it column-wise    df col  update  df col  - df col  min       df col  max   - df col  min     for col in df columns

User · Answer

Slightly modified from  Python Pandas Dataframe  Normalize data between 0 01 and 0 99  but from some of the comments thought it was relevant  sorry if considered a repost though      I wanted customized normalization in that regular percentile of datum or z-score was not adequate  Sometimes I knew what the feasible max and min of the population were  and therefore wanted to define it other than my sample  or a different midpoint  or whatever  This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1  but some of your data may need to be scaled in a more customized way    because percentiles and stdevs assumes your sample covers the population  but sometimes we know this isn t true  It was also very useful for me when visualizing data in heatmaps  So i built a custom function  used extra steps in the code here to make it as readable as possible    def NormData s low  min  center  mid  hi  max  insideout False shrinkfactor 0            if low   min           low min s      elif low   abs           low max abs min s   abs max s    -1  sign min s       if hi   max           hi max s      elif hi   abs           hi max abs min s   abs max s    1  sign max s        if center   mid           center  max s  min s   2     elif center   avg           center mean s      elif center   median           center median s       s2  x-center for x in s      hi hi-center     low low-center     center 0       r         for x in s2          if x lt low              r append 0           elif x gt hi              r append 1           else              if x gt  center                  r append  x-center   hi-center  0 5 0 5              else                  r append  x-low   center-low  0 5 0        if insideout  True          ir   1 -abs z-0 5  2   for z in r          r ir      rr   x- x-0 5  shrinkfactor for x in r          return rr   This will take in a pandas series  or even just a list and normalize it to your specified low  center  and high points  also there is a shrink factor  to allow you to scale down the data away from endpoints 0 and 1  I had to do this when combining colormaps in matplotlib Single pcolormesh with more than one colormap using Matplotlib  So you can likely see how the code works  but basically say you have values  -5 1 10  in a sample  but want to normalize based on a range of -7 to 7  so anything above 7  our  10  is treated as a 7 effectively  with a midpoint of 2  but shrink it to fit a 256 RGB colormap    In 1  NormData  -5 2 10  low -7 center 1 hi 7 shrinkfactor 2  256   Out 1   0 1279296875  0 5826822916666667  0 99609375    It can also turn your data inside out    this may seem odd  but I found it useful for heatmapping  Say you want a darker color for values closer to 0 rather than hi low  You could heatmap based on normalized data where insideout True    In 2  NormData  -5 2 10  low -7 center 1 hi 7 insideout True shrinkfactor 2  256   Out 2   0 251953125  0 8307291666666666  0 00390625    So now  2  which is closest to the center  defined as  1  is the highest value   Anyways  I thought my application was relevant if you re looking to rescale data in other ways that could have useful applications to you

User · Answer

You can use apply for this  and it s a bit neater   import numpy as np import pandas as pd  np random seed 1   df   pd DataFrame np random randn 4 4   4   3             0         1         2         3 0  9 497381  0 552974  0 887313 -1 291874 1  6 461631 -6 206155  9 979247 -0 044828 2  4 276156  2 002518  8 848432 -5 240563 3  1 710331  1 463783  7 535078 -1 399565  df apply lambda x   x - np mean x      np max x  - np min x               0         1         2         3 0  0 515087  0 133967 -0 651699  0 135175 1  0 125241 -0 689446  0 348301  0 375188 2 -0 155414  0 310554  0 223925 -0 624812 3 -0 484913  0 244924  0 079473  0 114448   Also  it works nicely with groupby  if you select the relevant columns   df  grp       A    A    B    B              0         1         2         3 grp 0  9 497381  0 552974  0 887313 -1 291874   A 1  6 461631 -6 206155  9 979247 -0 044828   A 2  4 276156  2 002518  8 848432 -5 240563   B 3  1 710331  1 463783  7 535078 -1 399565   B   df groupby   grp     0 1 2 3   apply lambda x   x - np mean x      np max x  - np min x          0    1    2    3 0  0 5  0 5 -0 5 -0 5 1 -0 5 -0 5  0 5  0 5 2  0 5  0 5  0 5 -0 5 3 -0 5 -0 5 -0 5  0 5

User · Answer

In  92   df Out 92              a         b          c         d A  -0 488816  0 863769   4 325608 -4 721202 B -11 937097  2 993993 -12 916784 -1 086236 C  -5 569493  4 672679  -2 168464 -9 315900 D   8 892368  0 932785   4 535396  0 598124  In  93   df norm    df - df mean       df max   - df min     In  94   df norm Out 94             a         b         c         d A  0 085789 -0 394348  0 337016 -0 109935 B -0 463830  0 164926 -0 650963  0 256714 C -0 158129  0 605652 -0 035090 -0 573389 D  0 536170 -0 376229  0 349037  0 426611  In  95   df norm mean   Out 95   a   -2 081668e-17 b    4 857226e-17 c    1 734723e-17 d   -1 040834e-17  In  96   df norm max   - df norm min   Out 96   a    1 b    1 c    1 d    1

[python] Normalize data in pandas

Examples related to python

Examples related to pandas

Examples related to numpy