I want to convert a table, represented as a list of lists, into a Pandas DataFrame
. As an extremely simplified example:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)
What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.
Here's a chart that summarises some of the most important conversions in pandas.
Conversions to string are trivial .astype(str)
and are not shown in the figure.
Note that "conversions" in this context could either refer to converting text data into their actual data type (hard conversion), or inferring more appropriate data types for data in object columns (soft conversion). To illustrate the difference, take a look at
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': [4, 5, 6]}, dtype=object)
df.dtypes
a object
b object
dtype: object
# Actually converts string to numeric - hard conversion
df.apply(pd.to_numeric).dtypes
a int64
b int64
dtype: object
# Infers better data types for object data - soft conversion
df.infer_objects().dtypes
a object # no change
b int64
dtype: object
# Same as infer_objects, but converts to equivalent ExtensionType
df.convert_dtypes().dtypes
How about this?
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]:
one two three
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
df.dtypes
Out[17]:
one object
two object
three object
df[['two', 'three']] = df[['two', 'three']].astype(float)
df.dtypes
Out[19]:
one object
two float64
three float64
I thought I had the same problem but actually I have a slight difference that makes the problem easier to solve. For others looking at this question it's worth checking the format of your input list. In my case the numbers are initially floats not strings as in the question:
a = [['a', 1.2, 4.2], ['b', 70, 0.03], ['x', 5, 0]]
but by processing the list too much before creating the dataframe I lose the types and everything becomes a string.
Creating the data frame via a numpy array
df = pd.DataFrame(np.array(a))
df
Out[5]:
0 1 2
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
df[1].dtype
Out[7]: dtype('O')
gives the same data frame as in the question, where the entries in columns 1 and 2 are considered as strings. However doing
df = pd.DataFrame(a)
df
Out[10]:
0 1 2
0 a 1.2 4.20
1 b 70.0 0.03
2 x 5.0 0.00
df[1].dtype
Out[11]: dtype('float64')
does actually give a data frame with the columns in the correct format
this below code will change datatype of column.
df[['col.name1', 'col.name2'...]] = df[['col.name1', 'col.name2'..]].astype('data_type')
in place of data type you can give your datatype .what do you want like str,float,int etc.
How about creating two dataframes, each with different data types for their columns, and then appending them together?
d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float)
d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))
Results
In[8}: d1.dtypes
Out[8]:
float_column float64
string_column object
dtype: object
After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column.
Starting pandas 1.0.0, we have pandas.DataFrame.convert_dtypes
. You can even control what types to convert!
In [40]: df = pd.DataFrame(
...: {
...: "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
...: "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
...: "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
...: "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
...: "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
...: "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
...: }
...: )
In [41]: dff = df.copy()
In [42]: df
Out[42]:
a b c d e f
0 1 x True h 10.0 NaN
1 2 y False i NaN 100.5
2 3 z NaN NaN 20.0 200.0
In [43]: df.dtypes
Out[43]:
a int32
b object
c object
d object
e float64
f float64
dtype: object
In [44]: df = df.convert_dtypes()
In [45]: df.dtypes
Out[45]:
a Int32
b string
c boolean
d string
e Int64
f float64
dtype: object
In [46]: dff = dff.convert_dtypes(convert_boolean = False)
In [47]: dff.dtypes
Out[47]:
a Int32
b string
c object
d string
e Int64
f float64
dtype: object
When I've only needed to specify specific columns, and I want to be explicit, I've used (per DOCS LOCATION):
dataframe = dataframe.astype({'col_name_1':'int','col_name_2':'float64', etc. ...})
So, using the original question, but providing column names to it ...
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col_name_1', 'col_name_2', 'col_name_3'])
df = df.astype({'col_name_2':'float64', 'col_name_3':'float64'})
Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers.
# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas
def coerce_df_columns_to_numeric(df, column_list):
df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')
So, for your example:
import pandas as pd
def coerce_df_columns_to_numeric(df, column_list):
df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
coerce_df_columns_to_numeric(df, ['col2','col3'])
Source: Stackoverflow.com