[python] Modifying a subset of rows in a pandas dataframe

Assume I have a pandas DataFrame with two columns, A and B. I'd like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?

I tried the following

df['A'==0]['B'] = np.nan

and

df['A'==0]['B'].values.fill(np.nan)

without success.

This question is related to python pandas

The answer is


Here is from pandas docs on advanced indexing:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0, select column B and set it to np.nan'


To replace multiples columns convert to numpy array using .values:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

For a massive speed increase, use NumPy's where function.

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with numpy.where

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster


Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc

here is a working example

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>> 

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

So, what we are doing above is applying df.loc[row_index, column_index] by:

  • Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
  • Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>> 

Then, we use the above array of booleans to select and modify the necessary rows:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.