I have a large dataframe with 423244 lines. I want to split this in to 4. I tried the following code which gave an error? ValueError: array split does not result in an equal division
for item in np.split(df, 4):
print item
How to split this dataframe in to 4 groups?
I wanted to do the same, and I had first problems with the split function, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!
# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
I also experienced np.array_split not working with Pandas DataFrame my solution was to only split the index of the DataFrame and then introduce a new column with the "group" label:
indexes = np.array_split(df.index,N, axis=0)
for i,index in enumerate(indexes):
df.loc[index,'group'] = i
This makes grouby operations very convenient for instance calculation of mean value of each group:
df.groupby(by='group').mean()
You can use groupby
, assuming you have an integer enumerated index:
import math
df = pd.DataFrame(dict(sample=np.arange(99)))
rows_per_subframe = math.ceil(len(df) / 4.)
subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]
Note: groupby
returns a tuple in which the 2nd element is the dataframe, thus the slightly complicated extraction.
>>> len(subframes), [len(i) for i in subframes]
(4, [25, 25, 25, 24])
Be aware that np.array_split(df, 3)
splits the dataframe into 3 sub-dataframes, while the split_dataframe
function defined in @elixir's answer, when called as split_dataframe(df, chunk_size=3)
, splits the dataframe every chunk_size
rows.
Example:
With np.array_split
:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST'])
df_split = np.array_split(df, 3)
...you get 3 sub-dataframes:
df_split[0] # 1, 2, 3, 4
df_split[1] # 5, 6, 7, 8
df_split[2] # 9, 10, 11
With split_dataframe
:
df_split2 = split_dataframe(df, chunk_size=3)
...you get 4 sub-dataframes:
df_split2[0] # 1, 2, 3
df_split2[1] # 4, 5, 6
df_split2[2] # 7, 8, 9
df_split2[3] # 10, 11
Hope I'm right, and that this is useful.
I guess now we can use plain iloc
with range
for this.
chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
df_subset = df.iloc[start:start + chunk_size]
process_data(df_subset)
....
Caution:
np.array_split
doesn't work with numpy-1.9.0. I checked out: It works with 1.8.1.
Error:
Dataframe has no 'size' attribute
you can use list comprehensions to do this in a single line
n = 4
chunks = [df[i:i+n] for i in range(0,df.shape[0],n)]
Source: Stackoverflow.com