[python] pandas DataFrame: replace nan values with average of columns

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.

This question is related to python pandas nan

The answer is


If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))

In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:

df.fillna(df.mean())

In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.

I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value

#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----

df.fillna(df.mean())

#------------------------------------------------
#----(Code 2) Selective Treatment----------------

for i in df.columns[df.isnull().any(axis=0)]:     #---Applying Only on variables with NaN values
    df[i].fillna(df[i].mean(),inplace=True)

#---df.isnull().any(axis=0) gives True/False flag (Boolean value series), 
#---which when applied on df.columns[], helps identify variables with NaN values

Below is the performance i observed, as i kept on increasing the # records in DataFrame

DataFrame with ~100k records

  • Code 1: 22.06 Seconds
  • Code 2: 0.03 Seconds

DataFrame with ~200k records

  • Code 1: 180.06 Seconds
  • Code 2: 0.06 Seconds

DataFrame with ~1.6 Million records

  • Code 1: code kept running endlessly
  • Code 2: 0.40 Seconds

DataFrame with ~13 Million records

  • Code 1: --did not even try, after seeing performance on 1.6 Mn records--
  • Code 2: 3.20 Seconds

Apologies for a long answer ! Hope this helps !


Directly use df.fillna(df.mean()) to fill all the null value with mean

If you want to fill null value with mean of that column then you can use this

suppose x=df['Item_Weight'] here Item_Weight is column name

here we are assigning (fill null values of x with mean of x into x)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

here Outlet_size is column name

df.Outlet_Size = df.Outlet_Size.fillna('Missing')

Another option besides those above is:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.


Try:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)

using sklearn library preprocessing class

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

Note: In the recent version parameter missing_values value change to np.nan from NaN


# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column

Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']

If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:

Use method .fillna():

mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)

I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.

You should be careful when using the mean. If you have outliers is more recommendable to use the median


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to nan

Display rows with one or more NaN values in pandas dataframe How to find which columns contain any NaN value in Pandas dataframe How to set a cell to NaN in a pandas dataframe Elegant way to create empty pandas DataFrame with NaN of type float How to check if any value is NaN in a Pandas DataFrame How to replace NaNs by preceding values in pandas DataFrame? Pandas Replace NaN with blank/empty string Convert pandas.Series from dtype object to float, and errors to nans How to filter in NaN (pandas)? Replace None with NaN in pandas dataframe