[python] Convert Pandas column containing NaNs to dtype `int`

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

When I try to cast the id column to integer while reading the .csv, I get:

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

Alternatively, I tried to convert the column type after reading as below, but this time I get:

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

How can I tackle this?

This question is related to python pandas na

The answer is


I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.

for col in discrete:
df[col] = pd.to_numeric(df[col], errors='coerce').astype(pd.Int64Dtype())

Try this:

df[['id']] = df[['id']].astype(pd.Int64Dtype())

If you print it's dtypes, you will get id Int64 instead of normal one int64


If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:

df['col'] = (
    df['col'].fillna(0)
    .astype(int)
    .astype(object)
    .where(df['col'].notnull())
)

This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.


If you want to use it when you chain methods, you can use assign:

df = (
     df.assign(col = lambda x: x['col'].astype('Int64'))
)

use pd.to_numeric()

df["DateColumn"] = pd.to_numeric(df["DateColumn"])

simple and clean


It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values


My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It's not pretty but it gets the job done!


I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df

You could use .dropna() if it is OK to drop the rows with the NaN values.

df = df.dropna(subset=['id'])

Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)

For the illustration, here is an example how floats may loose the precision:

s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)

And the output is:

1.2345678901234567e+19 12345678901234567168 12345678901234567890

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')

Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))

As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.

When reading in your data all you have to do is:

df= pd.read_csv("data.csv", dtype={'id': 'Int64'})  

Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.

As a side note, this will also work with .astype()

df['id'] = df['id'].astype('Int64')

Documentation here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html


import pandas as pd

df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])

First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work


If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

if row['id']:
   regular_process(row)
else:
   special_process(row)

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to na

Removing NA in dplyr pipe Change the Blank Cells to "NA" How to create an empty matrix in R? Convert Pandas column containing NaNs to dtype `int` How to replace NA values in a table for selected columns How to delete columns that contain ONLY NAs? How to avoid warning when introducing NAs by coercion Replace NA with 0 in a data frame column Exclude Blank and NA in R Omit rows containing specific column of NA