I have data saved in a postgreSQL
database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df
looks like this:
Station ID Pollutants
8809 {"a": "46", "b": "3", "c": "12"}
8810 {"a": "36", "b": "5", "c": "8"}
8811 {"b": "2", "c": "7"}
8812 {"c": "11"}
8813 {"a": "82", "c": "15"}
I need to split this column into separate columns, so that the DataFrame `df2 looks like this:
Station ID a b c
8809 46 3 12
8810 36 5 8
8811 NaN 2 7
8812 NaN NaN 11
8813 82 NaN 15
The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third).
The following code USED to work and return exactly what I wanted (df2).
objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)
I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:
IndexError: out-of-bounds on slice (end)
I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.
Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!
EDIT: I think the .tolist()
and .apply methods are not working on my code because it is one Unicode string, i.e.:
#My data format
u{'a': '1', 'b': '2', 'c': '3'}
#and not
{u'a': '1', u'b': '2', u'c': '3'}
The data is imported from the postgreSQL
database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?
This question is related to
python
json
pandas
dataframe
dictionary
One line solution is following:
>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
Station ID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize
:
import pandas as pd
df2 = pd.json_normalize(df['Pollutant Levels'])
This avoids costly apply functions...
Merlin's answer is better and super easy, but we don't need a lambda function. The evaluation of dictionary can be safely ignored by either of the following two ways as illustrated below:
Way 1: Two steps
# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)
df_pol_ps:
a b c
0 46 3 12
1 36 5 8
2 NaN 2 7
3 NaN NaN 11
4 82 NaN 15
# step 2: concat columns `a, b, c` and drop/remove the `Pollutants`
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)
df_final:
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
Way 2: The above two steps can be combined in one go:
df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)
df_final:
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
You can use join
with pop
+ tolist
. Performance is comparable to concat
with drop
+ tolist
, but some may find this syntax cleaner:
res = df.join(pd.DataFrame(df.pop('b').tolist()))
Benchmarking with other methods:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
def joris1(df):
return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)
def joris2(df):
return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
def jpp(df):
return df.join(pd.DataFrame(df.pop('b').tolist()))
df = pd.concat([df]*1000, ignore_index=True)
%timeit joris1(df.copy()) # 1.33 s per loop
%timeit joris2(df.copy()) # 7.42 ms per loop
%timeit jpp(df.copy()) # 7.68 ms per loop
dicts
, as per the timing analysis performed by Shijith in this answer:
df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
list
or dicts
that are addressed below, such as rows with NaN
, or nested dicts
.pd.json_normalize(df.Pollutants)
is significantly faster than df.Pollutants.apply(pd.Series)
%%timeit
below. For 1M rows, .json_normalize
is 47 times faster than .apply
.dict
column has dict
or str
type.
str
type, they must be converted back to a dict
type, using ast.literal_eval
.pd.json_normalize
to convert the dicts
, with keys
as headers and values
for rows.
record_path
& meta
) for dealing with nested dicts
.pandas.DataFrame.join
to combine the original DataFrame, df
, with the columns created using pd.json_normalize
df.reset_index()
to get an index of integers, before doing the normalize and join.pandas.DataFrame.drop
, to remove the unneeded column of dicts
NaN
, they must be filled with an empty dict
df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})
'Pollutants'
column is strings, use '{}'
.import pandas as pd
from ast import literal_eval
import numpy as np
data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}
df = pd.DataFrame(data)
# display(df)
Station ID Pollutants
0 8809 {"a": "46", "b": "3", "c": "12"}
1 8810 {"a": "36", "b": "5", "c": "8"}
2 8811 {"b": "2", "c": "7"}
3 8812 {"c": "11"}
4 8813 {"a": "82", "c": "15"}
5 8814 NaN
# replace NaN with '{}' if the column is strings, otherwise replace with {}
# df.Pollutants = df.Pollutants.fillna('{}') # if the NaN is in a column of strings
df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index}) # if the column is not strings
# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)
# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True) # uncomment if needed
# normalize the column of dictionaries and join it to df
df = df.join(pd.json_normalize(df.Pollutants))
# drop Pollutants
df.drop(columns=['Pollutants'], inplace=True)
# display(df)
Station ID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
5 8814 NaN NaN NaN
%%timeit
# dataframe with 1M rows
dfb = pd.concat([df]*200000).reset_index(drop=True)
%%timeit
dfb.join(pd.json_normalize(dfb.Pollutants))
[out]:
5.44 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
[out]:
4min 17s ± 2.44 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
I strongly recommend the method extract the column 'Pollutants':
df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)
it's much faster than
df_pollutants = df['Pollutants'].apply(pd.Series)
when the size of df is giant.
How do I split a column of dictionaries into separate columns with pandas?
pd.DataFrame(df['val'].tolist())
is the canonical method for exploding a column of dictionariesHere's your proof using a colorful graph.
Benchmarking code for reference.
Note that I am only timing the explosion since that's the most interesting part of answering this question - other aspects of result construction (such as whether to use pop
or drop
) are tangential to the discussion and can be ignored (it should be noted however that using pop
avoids the followup drop
call, so the final solution is a bit more performant, but we are still listifying the column and passing it to pd.DataFrame
either way).
Additionally, pop
destructively mutates the input DataFrame, making it harder to run in benchmarking code which assumes the input is not changed across test runs.
df['val'].apply(pd.Series)
is extremely slow for large N as pandas constructs Series objects for each row, then proceeds to construct a DataFrame from them. For larger N the performance dips to the order of minutes or hours.
pd.json_normalize(df['val']))
is slower simply because json_normalize
is meant to work with a much more complex input data - particularly deeply nested JSON with multiple record paths and metadata. We have a simple flat dict for which pd.DataFrame
suffices, so use that if your dicts are flat.
Some answers suggest df.pop('val').values.tolist()
or df.pop('val').to_numpy().tolist()
. I don't think it makes much of a difference whether you listify the series or the numpy array. It's one operation less to listify the series directly and really isn't slower so I'd recommend avoiding generating the numpy array in the intermediate step.
Try this: The data returned from SQL has to converted into a Dict.
or could it be "Pollutant Levels"
is now Pollutants'
StationID Pollutants
0 8809 {"a":"46","b":"3","c":"12"}
1 8810 {"a":"36","b":"5","c":"8"}
2 8811 {"b":"2","c":"7"}
3 8812 {"c":"11"}
4 8813 {"a":"82","c":"15"}
df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )
a b c
0 46 3 12
1 36 5 8
2 NaN 2 7
3 NaN NaN 11
4 82 NaN 15
result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
I've concatenated those steps in a method, you have to pass only the dataframe and the column which contains the dict to expand:
def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
"""
dw: DataFrame with some column which contain a dict to expand
in columns
column_to_expand: String with column name of dw
"""
import pandas as pd
def convert_to_dict(sequence: str) -> Dict:
import json
s = sequence
json_acceptable_string = s.replace("'", "\"")
d = json.loads(json_acceptable_string)
return d
expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
dw[column_to_expand]
.apply(convert_to_dict)
.apply(pd.Series)],
axis=1)
return expanded_dataframe
>>> df
Station ID Pollutants
0 8809 {"a": "46", "b": "3", "c": "12"}
1 8810 {"a": "36", "b": "5", "c": "8"}
2 8811 {"b": "2", "c": "7"}
3 8812 {"c": "11"}
4 8813 {"a": "82", "c": "15"}
speed comparison for a large dataset of 10 million rows
>>> df = pd.concat([df]*100000).reset_index(drop=True)
>>> df = pd.concat([df]*20).reset_index(drop=True)
>>> print(df.shape)
(10000000, 2)
def apply_drop(df):
return df.join(df['Pollutants'].apply(pd.Series)).drop('Pollutants', axis=1)
def json_normalise_drop(df):
return df.join(pd.json_normalize(df.Pollutants)).drop('Pollutants', axis=1)
def tolist_drop(df):
return df.join(pd.DataFrame(df['Pollutants'].tolist())).drop('Pollutants', axis=1)
def vlues_tolist_drop(df):
return df.join(pd.DataFrame(df['Pollutants'].values.tolist())).drop('Pollutants', axis=1)
def pop_tolist(df):
return df.join(pd.DataFrame(df.pop('Pollutants').tolist()))
def pop_values_tolist(df):
return df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
>>> %timeit apply_drop(df.copy())
1 loop, best of 3: 53min 20s per loop
>>> %timeit json_normalise_drop(df.copy())
1 loop, best of 3: 54.9 s per loop
>>> %timeit tolist_drop(df.copy())
1 loop, best of 3: 6.62 s per loop
>>> %timeit vlues_tolist_drop(df.copy())
1 loop, best of 3: 6.63 s per loop
>>> %timeit pop_tolist(df.copy())
1 loop, best of 3: 5.99 s per loop
>>> %timeit pop_values_tolist(df.copy())
1 loop, best of 3: 5.94 s per loop
+---------------------+-----------+
| apply_drop | 53min 20s |
| json_normalise_drop | 54.9 s |
| tolist_drop | 6.62 s |
| vlues_tolist_drop | 6.63 s |
| pop_tolist | 5.99 s |
| pop_values_tolist | 5.94 s |
+---------------------+-----------+
df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
is the fastest
my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])
.. would have parsed the dict properly (putting each dict key into a separate df column, and key values into df rows), so the dicts would not get squashed into a single column in the first place.
To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval)
. Afterwards, the solution below can be used to convert the dict to different columns.
Using a small example, you can use .apply(pd.Series)
:
In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
In [3]: df
Out[3]:
a b
0 1 {u'c': 1}
1 2 {u'd': 3}
2 3 {u'c': 5, u'd': 6}
In [4]: df['b'].apply(pd.Series)
Out[4]:
c d
0 1.0 NaN
1 NaN 3.0
2 5.0 6.0
To combine it with the rest of the dataframe, you can concat
the other columns with the above result:
In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
Using your code, this also works if I leave out the iloc
part:
In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)
Source: Stackoverflow.com