[python] Split pandas dataframe in two if it has more than 10 rows

I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.

If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.

Is there a convenient function for this? I've looked around but found nothing useful...

i.e. split_dataframe(df, 2(if > 10))?

This question is related to python pandas split dataframe

The answer is


I used a List Comprehension to cut a huge DataFrame into blocks of 100'000:

size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]

or as generator:

list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))

def split_and_save_df(df, name, size, output_dir):
    """
    Split a df and save each chunk in a different csv file.

    Parameters:
        df : pandas df to be splitted
        name : name to give to the output file
        size : chunk size
        output_dir : directory where to write the divided df
    """
    import os
    for i in range(0, df.shape[0],size):
        start  = i
        end    = min(i+size-1, df.shape[0]) 
        subset = df.loc[start:end] 
        output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
        print(f"Going to write into {output_path}")
        subset.to_csv(output_path)
        output_size = os.stat(output_path).st_size
        print(f"Wrote {output_size} bytes")

A method based on np.split:

df = pd.DataFrame({    'A':[2,4,6,8,10,2,4,6,8,10],
                       'B':[10,-10,0,20,-10,10,-10,0,20,-10],
                       'C':[4,12,8,0,0,4,12,8,0,0],
                      'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})

listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]

A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4) will throw an error).

(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)


The method based on list comprehension and groupby, which stores all the split dataframes in a list variable and can be accessed using the index.

Example:

ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name

If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:

max_rows = 4500
dataframes = []
while len(df) > max_rows:
    top = df[:max_rows]
    dataframes.append(top)
    df = df[max_rows:]
else:
    dataframes.append(df)

You could then save out these data frames:

for _, frame in enumerate(dataframes):
    frame.to_csv(str(_)+'.csv', index=False)

Hope this helps someone!


You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10

def split(df, headSize) :
    hd = df.head(headSize)
    tl = df.tail(len(df)-headSize)
    return hd, tl

df = pd.DataFrame({    'A':[2,4,6,8,10,2,4,6,8,10],
                       'B':[10,-10,0,20,-10,10,-10,0,20,-10],
                       'C':[4,12,8,0,0,4,12,8,0,0],
                      'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})

# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)

There is no specific convenience function.

You'd have to do something like:

first_ten = pd.DataFrame()
rest = pd.DataFrame()

if df.shape[0] > 10: # len(df) > 10 would also work
    first_ten = df[:10]
    rest = df[10:]

Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:

import pandas as pd

def split_dataframe_to_chunks(df, n):
    df_len = len(df)
    count = 0
    dfs = []

    while True:
        if count > df_len-1:
            break

        start = count
        count += n
        #print("%s : %s" % (start, count))
        dfs.append(df.iloc[start : count])
    return dfs


# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])

# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]

# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to split

Parameter "stratify" from method "train_test_split" (scikit Learn) Pandas split DataFrame by column value How to split large text file in windows? Attribute Error: 'list' object has no attribute 'split' Split function in oracle to comma separated values with automatic sequence How would I get everything before a : in a string Python Split String by delimiter position using oracle SQL JavaScript split String with white space Split a String into an array in Swift? Split pandas dataframe in two if it has more than 10 rows

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe