pandas DataFrame replace nan values with average of columns

Question

I ve got a pandas DataFrame filled mostly with real numbers  but there is a few nan values in it as well   How can I replace the nans with averages of columns where they are   This question is very similar to this one  numpy array  replace nan values with average of columns  but  unfortunately  the solution given there doesn t work for a pandas DataFrame

User · Answer

Directly use df fillna df mean    to fill all the null value with mean  If you want to fill null value with mean of that column then you can use this  suppose x df  Item Weight   here Item Weight is column name  here we are assigning  fill null values of x with mean of x into x   df  Item Weight     df  Item Weight   fillna  df  Item Weight   mean       If you want to fill null value with some string then use   here Outlet size is column name   df Outlet Size   df Outlet Size fillna  Missing

User · Answer

To read data from csv file Dataset   pd read csv  Data csv    X   Dataset iloc     -1  values    To calculate mean use imputer class from sklearn impute import SimpleImputer imputer   SimpleImputer missing values np nan  strategy  mean   imputer   imputer fit X    1 3   X    1 3    imputer transform X    1 3

User · Answer

Although  the below code does the job  BUT its performance takes a big hit  as you deal with a DataFrame with   records 100k or more  df fillna df mean     In my experience  one should replace NaN values  be it with Mean or Median   only where it is required  rather than applying fillna   all over the DataFrame  I had a DataFrame with 20 variables  and only 4 of them required NaN values treatment  replacement   I tried the above code  Code 1   along with a slightly modified version of it  code 2   where i ran it selectively  i e  only on variables which had a NaN value  ------------------------------------------------  ---- Code 1  Treatment on overall DataFrame-----  df fillna df mean      ------------------------------------------------  ---- Code 2  Selective Treatment----------------  for i in df columns df isnull   any axis 0         ---Applying Only on variables with NaN values     df i  fillna df i  mean   inplace True    ---df isnull   any axis 0  gives True False flag  Boolean value series     ---which when applied on df columns    helps identify variables with NaN values  Below is the performance i observed  as i kept on increasing the   records in DataFrame DataFrame with  100k records  Code 1  22 06 Seconds Code 2   0 03 Seconds  DataFrame with  200k records  Code 1  180 06 Seconds Code 2    0 06 Seconds  DataFrame with  1 6 Million records  Code 1  code kept running endlessly Code 2  0 40 Seconds  DataFrame with  13 Million records  Code 1  --did not even try  after seeing performance on 1 6 Mn records-- Code 2  3 20 Seconds  Apologies for a long answer   Hope this helps

User · Answer

If you want to impute missing values with mean and you want to go column by column  then this will only impute with the mean of that column  This might be a little more readable   sub2  income     sub2  income   fillna  sub2  income   mean

User · Answer

Try   sub2  income   fillna  sub2  income   mean     inplace True

User · Answer

Pandas  How to replace NaN  nan  values with the average  mean   median or other statistics of one column  Say your DataFrame is df and you have one column called nr items  This is   df  nr items    If you want to replace the NaN values of your column df  nr items   with the mean of the column   Use method  fillna     mean value df  nr items   mean   df  nr item ave   df  nr items   fillna mean value   I have created a new df column called nr item ave to store the new column with the NaN values replaced by the mean value of the column   You should be careful when using the mean  If you have outliers is more recommendable to use the median

User · Answer

In  16   df   DataFrame np random randn 10 3    In  17   df iloc 3 5 0    np nan  In  18   df iloc 4 6 1    np nan  In  19   df iloc 5 8 2    np nan  In  20   df Out 20              0         1         2 0  1 148272  0 227366 -2 368136 1 -0 820823  1 071471 -0 784713 2  0 157913  0 602857  0 665034 3       NaN -0 985188 -0 324136 4       NaN       NaN  0 238512 5  0 769657       NaN       NaN 6  0 141951  0 326064       NaN 7 -1 694475 -0 523440       NaN 8  0 352556 -0 551487 -1 639298 9 -2 067324 -0 492617 -1 675794  In  22   df mean   Out 22    0   -0 251534 1   -0 040622 2   -0 841219 dtype  float64   Apply per-column the mean of that columns and fill  In  23   df apply lambda x  x fillna x mean    axis 0  Out 23              0         1         2 0  1 148272  0 227366 -2 368136 1 -0 820823  1 071471 -0 784713 2  0 157913  0 602857  0 665034 3 -0 251534 -0 985188 -0 324136 4 -0 251534 -0 040622  0 238512 5  0 769657 -0 040622 -0 841219 6  0 141951  0 326064 -0 841219 7 -1 694475 -0 523440 -0 841219 8  0 352556 -0 551487 -1 639298 9 -2 067324 -0 492617 -1 675794

User · Answer

You can simply use DataFrame fillna to fill the nan s directly   In  27   df  Out 27              A         B         C 0 -0 166919  0 979728 -0 632955 1 -0 297953 -0 912674 -1 365463 2 -0 120211 -0 540679 -0 680481 3       NaN -2 027325  1 533582 4       NaN       NaN  0 461821 5 -0 788073       NaN       NaN 6 -0 916080 -0 612343       NaN 7 -0 887858  1 033826       NaN 8  1 948430  1 025011 -2 982224 9  0 019698 -0 795876 -0 046431  In  28   df mean   Out 28    A   -0 151121 B   -0 231291 C   -0 530307 dtype  float64  In  29   df fillna df mean    Out 29              A         B         C 0 -0 166919  0 979728 -0 632955 1 -0 297953 -0 912674 -1 365463 2 -0 120211 -0 540679 -0 680481 3 -0 151121 -2 027325  1 533582 4 -0 151121 -0 231291  0 461821 5 -0 788073 -0 231291 -0 530307 6 -0 916080 -0 612343 -0 530307 7 -0 887858  1 033826 -0 530307 8  1 948430  1 025011 -2 982224 9  0 019698 -0 795876 -0 046431   The docstring of fillna says that value should be a scalar or a dict  however  it seems to work with a Series as well  If you want to pass a dict  you could use df mean   to dict

User · Answer

Another option besides those above is   df   df groupby df columns  axis   1  transform lambda x  x fillna x mean       It s less elegant than previous responses for mean  but it could be shorter if you desire to replace nulls by some other column function

User · Answer

using sklearn library preprocessing class  from sklearn impute import SimpleImputer missingvalues   SimpleImputer missing values   np nan  strategy    mean   axis   0  missingvalues   missingvalues fit x   1 3   x   1 3    missingvalues transform x   1 3     Note  In the recent version parameter missing values value change to np nan from NaN

[python] pandas DataFrame: replace nan values with average of columns

Examples related to python

Examples related to pandas

Examples related to nan