I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.
If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.
Is there a convenient function for this? I've looked around but found nothing useful...
i.e. split_dataframe(df, 2(if > 10))
?
I used a List Comprehension to cut a huge DataFrame into blocks of 100'000:
size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]
or as generator:
list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))
def split_and_save_df(df, name, size, output_dir):
"""
Split a df and save each chunk in a different csv file.
Parameters:
df : pandas df to be splitted
name : name to give to the output file
size : chunk size
output_dir : directory where to write the divided df
"""
import os
for i in range(0, df.shape[0],size):
start = i
end = min(i+size-1, df.shape[0])
subset = df.loc[start:end]
output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
print(f"Going to write into {output_path}")
subset.to_csv(output_path)
output_size = os.stat(output_path).st_size
print(f"Wrote {output_size} bytes")
A method based on np.split
:
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]
A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4)
will throw an error).
(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)
The method based on list comprehension and groupby
, which stores all the split dataframes in a list variable and can be accessed using the index.
Example:
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name
If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:
max_rows = 4500
dataframes = []
while len(df) > max_rows:
top = df[:max_rows]
dataframes.append(top)
df = df[max_rows:]
else:
dataframes.append(df)
You could then save out these data frames:
for _, frame in enumerate(dataframes):
frame.to_csv(str(_)+'.csv', index=False)
Hope this helps someone!
You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10
def split(df, headSize) :
hd = df.head(headSize)
tl = df.tail(len(df)-headSize)
return hd, tl
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)
There is no specific convenience function.
You'd have to do something like:
first_ten = pd.DataFrame()
rest = pd.DataFrame()
if df.shape[0] > 10: # len(df) > 10 would also work
first_ten = df[:10]
rest = df[10:]
Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:
import pandas as pd
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
#print("%s : %s" % (start, count))
dfs.append(df.iloc[start : count])
return dfs
# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])
# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]
# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]
Source: Stackoverflow.com