I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id
, I want to specify the column type as int
. The problem is the id
series has missing/empty values.
When I try to cast the id
column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col], errors='coerce').astype(pd.Int64Dtype())
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes
, you will get id Int64
instead of normal one int64
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
)
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean
It is now possible to create a pandas column containing NaNs as dtype int
, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv
in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
You could use .dropna()
if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
Alternatively,
use .fillna()
and .astype()
to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray
. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array()
or Series
:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work
If you can modify your stored data, use a sentinel value for missing id
. A common use case, inferred by the column name, being that id
is an integer, strictly greater than zero, you could use 0
as a sentinel value so that you can write
if row['id']:
regular_process(row)
else:
special_process(row)
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
Source: Stackoverflow.com