[python] Pandas: convert dtype 'object' to int

I've read an SQL query into Pandas and the values are coming in as dtype 'object', although they are strings, dates and integers. I am able to convert the date 'object' to a Pandas datetime dtype, but I'm getting an error when trying to convert the string and integers.

Here is an example:

>>> import pandas as pd
>>> df = pd.read_sql_query('select * from my_table', conn)
>>> df
    id    date          purchase
 1  abc1  2016-05-22    1
 2  abc2  2016-05-29    0
 3  abc3  2016-05-22    2
 4  abc4  2016-05-22    0

>>> df.dtypes
 id          object
 date        object
 purchase    object
 dtype: object

Converting the df['date'] to a datetime works:

>>> pd.to_datetime(df['date'])
 1  2016-05-22
 2  2016-05-29
 3  2016-05-22
 4  2016-05-22
 Name: date, dtype: datetime64[ns] 

But I get an error when trying to convert the df['purchase'] to an integer:

>>> df['purchase'].astype(int)
 ....
 pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:16667)()
 pandas/src/util.pxd in util.set_value_at (pandas/lib.c:67540)()

 TypeError: long() argument must be a string or a number, not 'java.lang.Long'

NOTE: I get a similar error when I tried .astype('float')

And when trying to convert to a string, nothing seems to happen.

>>> df['id'].apply(str)
 1 abc1
 2 abc2
 3 abc3
 4 abc4
 Name: id, dtype: object

This question is related to python pandas numpy

The answer is


Cannot comment so posting this as an answer, which is somewhat in between @piRSquared/@cyril's solution and @cs95's:

As noted by @cs95, if your data contains NaNs or Nones, converting to string type will throw an error when trying to convert to int afterwards.

However, if your data consists of (numerical) strings, using convert_dtypes will convert it to string type unless you use pd.to_numeric as suggested by @cs95 (potentially combined with df.apply()).

In the case that your data consists only of numerical strings (including NaNs or Nones but without any non-numeric "junk"), a possibly simpler alternative would be to convert first to float and then to one of the nullable-integer extension dtypes provided by pandas (already present in version 0.24) (see also this answer):

df['purchase'].astype(float).astype('Int64')

Note that there has been recent discussion on this on github (currently an -unresolved- closed issue though) and that in the case of very long 64-bit integers you may have to convert explicitly to float128 to avoid approximations during the conversions.


It's simple

pd.factorize(df.purchase)[0]

Example:

labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`
labels
# array([0, 0, 1, 2, 0])
uniques
# array(['b', 'a', 'c'], dtype=object)

My train data contains three features are object after applying astype it converts the object into numeric but before that, you need to perform some preprocessing steps:

train.dtypes

C12       object
C13       object
C14       Object

train['C14'] = train.C14.astype(int)

train.dtypes

C12       object
C13       object
C14       int32

Follow these steps:

1.clean your file -> open your datafile in csv format and see that there is "?" in place of empty places and delete all of them.

2.drop the rows containing missing values e.g.:

df.dropna(subset=["normalized-losses"], axis = 0 , inplace= True)

3.use astype now for conversion

df["normalized-losses"]=df["normalized-losses"].astype(int)

Note: If still finding erros in your program then again inspect your csv file, open it in excel to find whether is there an "?" in your required column, then delete it and save file and go back and run your program.

comment success! if it works. :)


In my case, I had a df with mixed data:

df:
                     0   1   2    ...                  242                  243                  244
0   2020-04-22T04:00:00Z   0   0  ...          3,094,409.5         13,220,425.7          5,449,201.1
1   2020-04-22T06:00:00Z   0   0  ...          3,716,941.5          8,452,012.9          6,541,599.9
....

The floats are actually objects, but I need them to be real floats.

To fix it, referencing @AMC's comment above:

def coerce_to_float(val):
    try:
       return float(val)
    except ValueError:
       return val

df = df.applymap(lambda x: coerce_to_float(x))

This was my data

## list of columns 
l1 = ['PM2.5', 'PM10', 'TEMP', 'BP', ' RH', 'WS','CO', 'O3', 'Nox', 'SO2'] 

for i in l1:
 for j in range(0, 8431): #rows = 8431
   df[i][j] = int(df[i][j])

I recommend you to use this only with small data. This code has complexity of O(n^2).


pandas >= 1.0

convert_dtypes

The (self) accepted answer doesn't take into consideration the possibility of NaNs in object columns.

df = pd.DataFrame({
     'a': [1, 2, np.nan], 
     'b': [True, False, np.nan]}, dtype=object) 
df                                                                         

     a      b
0    1   True
1    2  False
2  NaN    NaN

df['a'].astype(str).astype(int) # raises ValueError

This chokes because the NaN is converted to a string "nan", and further attempts to coerce to integer will fail. To avoid this issue, we can soft-convert columns to their corresponding nullable type using convert_dtypes:

df.convert_dtypes()                                                        

      a      b
0     1   True
1     2  False
2  <NA>   <NA>

df.convert_dtypes().dtypes                                                 

a      Int64
b    boolean
dtype: object

If your data has junk text mixed in with your ints, you can use pd.to_numeric as an initial step:

s = pd.Series(['1', '2', '...'])
s.convert_dtypes()  # converts to string, which is not what we want

0      1
1      2
2    ...
dtype: string 

# coerces non-numeric junk to NaNs
pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    NaN
dtype: float64

# one final `convert_dtypes` call to convert to nullable int
pd.to_numeric(s, errors='coerce').convert_dtypes() 

0       1
1       2
2    <NA>
dtype: Int64

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index