[python] Replacing blank values (white space) with NaN in pandas

I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.

Any ideas how this can be improved?

Basically I want to turn this:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux     

Into this:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

I've managed to do it with the code below, but man is it ugly. It's not Pythonic and I'm sure it's not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

It could be optimized a bit by only iterating through fields that could contain empty strings:

if df[i].dtype == np.dtype('object')

But that's not much of an improvement

And finally, this code sets the target strings to None, which works with Pandas' functions like fillna(), but it would be nice for completeness if I could actually insert a NaN directly instead of None.

This question is related to python pandas dataframe

The answer is


If you want to replace an empty string and records with only spaces, the correct answer is!:

df = df.replace(r'^\s*$', np.nan, regex=True)

The accepted answer

df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Note, also that 'fo o' is not replaced with Nan, though it contains a space. Further note, that a simple:

df.replace(r'', np.NaN)

Does not work either - try it out.


If you are exporting the data from the CSV file it can be as simple as this :

df = pd.read_csv(file_csv, na_values=' ')

This will create the data frame as well as replace blank values as Na


you can also use a filter to do it.

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

print(df.isnull().sum()) # check numbers of null value in each column

modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"

# modifiedDf = fd.dropna() # Remove rows with empty values

print(modifiedDf.isnull().sum()) # check numbers of null value in each column

I will did this:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can strip all str, then replace empty str with np.nan.


How about:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to every cell of the dataframe.


Simplest of all solutions:

df = df.replace(r'^\s+$', np.nan, regex=True)

This should work

df.loc[df.Variable == '', 'Variable'] = 'Value'

or

df.loc[df.Variable1 == '', 'Variable2'] = 'Value'

This is not an elegant solution, but what does seem to work is saving to XLSX and then importing it back. The other solutions on this page did not work for me, unsure why.

data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)

These are all close to the right answer, but I wouldn't say any solve the problem while remaining most readable to others reading your code. I'd say that answer is a combination of BrenBarn's Answer and tuomasttik's comment below that answer. BrenBarn's answer utilizes isspace builtin, but does not support removing empty strings, as OP requested, and I would tend to attribute that as the standard use case of replacing strings with null.

I rewrote it with .apply, so you can call it on a pd.Series or pd.DataFrame.


Python 3:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

To use this in Python 2, you'll need to replace str with basestring.

Python 2:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

This worked for me. When I import my csv file I added na_values = ' '. Spaces are not included in the default NaN values.

df= pd.read_csv(filepath,na_values = ' ')

For a very fast and simple solution where you check equality against a single value, you can use the mask method.

df.mask(df == ' ')

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe