Split a large pandas dataframe

Question

I have a large dataframe with 423244 lines  I want to split this in to 4  I tried the following code which gave an error  ValueError  array split does not result in an equal division  for item in np split df  4       print item   How to split this dataframe in to 4 groups

User · Answer

You can use groupby  assuming you have an integer enumerated index   import math df   pd DataFrame dict sample np arange 99    rows per subframe   math ceil len df    4    subframes    i 1  for i in df groupby np arange len df    rows per subframe     Note  groupby returns a tuple in which the 2nd element is the dataframe  thus the slightly complicated extraction    gt  gt  gt  len subframes    len i  for i in subframes   4   25  25  25  24

User · Answer

I also experienced np array split not working with Pandas DataFrame my solution was to only split the index of the DataFrame and then introduce a new column with the  group  label   indexes   np array split df index N  axis 0  for i index in enumerate indexes      df loc index  group     i   This makes grouby operations very convenient for instance calculation of mean value of each group   df groupby by  group   mean

User · Answer

Caution   np array split doesn t work with numpy-1 9 0  I checked out  It works with 1 8 1    Error      Dataframe has no  size  attribute

User · Answer

Be aware that np array split df  3  splits the dataframe into 3 sub-dataframes  while the split dataframe function defined in  elixir s answer  when called as split dataframe df  chunk size 3   splits the dataframe every chunk size rows  Example  With np array split  df   pd DataFrame  1 2 3 4 5 6 7 8 9 10 11   columns   TEST    df split   np array split df  3      you get 3 sub-dataframes  df split 0    1  2  3  4 df split 1    5  6  7  8 df split 2    9  10  11  With split dataframe  df split2   split dataframe df  chunk size 3      you get 4 sub-dataframes  df split2 0    1  2  3 df split2 1    4  5  6 df split2 2    7  8  9 df split2 3    10  11  Hope I m right  and that this is useful

User · Answer

I wanted to do the same  and I had first problems with the split function  then problems with installing pandas 0 15 2  so I went back to my old version  and wrote a little function that works very well  I hope this can help    input - df  a Dataframe  chunkSize  the chunk size   output - a list of DataFrame   purpose - splits the DataFrame into smaller chunks def split dataframe df  chunk size   10000        chunks   list       num chunks   len df     chunk size   1     for i in range num chunks           chunks append df i chunk size  i 1  chunk size       return chunks

User · Answer

I guess now we can use plain iloc with range for this   chunk size   int df shape 0    4  for start in range 0  df shape 0   chunk size       df subset   df iloc start start   chunk size      process data df subset

User · Answer

Use np array split   Docstring  Split an array into multiple sub-arrays   Please refer to the   split   documentation   The only difference between these functions is that   array split   allows  indices or sections  to be an integer that does  not  equally divide the axis      In  1   import pandas as pd  In  2   df   pd DataFrame   A      foo    bar    foo    bar                                      foo    bar    foo    foo                                B      one    one    two    three                                      two    two    one    three                                C    randn 8    D    randn 8     In  3   print df      A      B         C         D 0  foo    one -0 174067 -0 608579 1  bar    one -0 860386 -1 210518 2  foo    two  0 614102  1 689837 3  bar  three -0 284792 -1 071160 4  foo    two  0 843610  0 803712 5  bar    two -1 514722  0 870861 6  foo    one  0 131529 -0 968151 7  foo  three -1 002946 -0 257468  In  4   import numpy as np In  5   np array split df  3  Out 5          A    B         C         D 0  foo  one -0 174067 -0 608579 1  bar  one -0 860386 -1 210518 2  foo  two  0 614102  1 689837        A      B         C         D 3  bar  three -0 284792 -1 071160 4  foo    two  0 843610  0 803712 5  bar    two -1 514722  0 870861        A      B         C         D 6  foo    one  0 131529 -0 968151 7  foo  three -1 002946 -0 257468

User · Answer

you can use list comprehensions to do this in a single line n   4 chunks    df i i n  for i in range 0 df shape 0  n

[python] Split a large pandas dataframe

Examples related to python

Examples related to pandas