shuffling permutating a DataFrame in pandas

Question

What s a simple and efficient way to shuffle a dataframe in pandas  by rows or by columns  I e  how to write a function shuffle df  n  axis 0  that takes a dataframe  a number of shuffles n  and an axis  axis 0 is rows  axis 1 is columns  and returns a copy of the dataframe that has been shuffled n times    Edit  key is to do this without destroying the row column labels of the dataframe  If you just shuffle df index that loses all that information  I want the resulting df to be the same as the original except with the order of rows or order of columns different   Edit2  My question was unclear  When I say shuffle the rows  I mean shuffle each row independently  So if you have two columns a and b  I want each row shuffled on its own  so that you don t have the same associations between a and b as you do if you just re-order each row as a whole  Something like    for 1   n    for each col in df  shuffle column return new df   But hopefully more efficient than naive looping  This does not work for me   def shuffle df  n  axis 0           shuffled df   df copy           for k in range n               shuffled df apply np random shuffle shuffled df values  axis axis          return shuffled df  df   pandas DataFrame   A  range 10    B  range 10    shuffle df  5

User · Answer

This might be more useful when you want your index shuffled.

def shuffle(df):
    index = list(df.index)
    random.shuffle(index)
    df = df.ix[index]
    df.reset_index()
    return df

It selects new df using new index, then reset them.

User · Answer

I know the question is for a pandas df but in the case the shuffle occurs by row  column order changed  row order unchanged   then the columns names do not matter anymore and it could be interesting to use an np array instead  then np apply along axis   will be what you are looking for   If that is acceptable then this would be helpful  note it is easy to switch the axis along which the data is shuffled   If you panda data frame is named df  maybe you can    get the values of the dataframe with values   df values   create an np array from values apply the method shown below to shuffle the np array by row or column recreate a new  shuffled  pandas df from the shuffled np array   Original array  a   np array   10  11  12    20  21  22    30  31  32   40  41  42    print a    10 11 12    20 21 22    30 31 32    40 41 42     Keep row order  shuffle colums within each row  print np apply along axis np random permutation  1  a     11 12 10    22 21 20    31 30 32    40 41 42     Keep colums order  shuffle rows within each column  print np apply along axis np random permutation  0  a     40 41 32    20 31 42    10 11 12    30 21 22     Original array is unchanged  print a    10 11 12    20 21 22    30 31 32    40 41 42

User · Answer

A simple solution in pandas is to use the sample method independently on each column  Use apply to iterate over each column   df   pd DataFrame   a   1 2 3 4 5 6    b   1 2 3 4 5 6    df     a  b 0  1  1 1  2  2 2  3  3 3  4  4 4  5  5 5  6  6  df apply lambda x  x sample frac 1  values      a  b 0  4  2 1  1  6 2  6  5 3  5  3 4  2  4 5  3  1   You must use  value so that you return a numpy array and not a Series  or else the returned Series will align to the original DataFrame not changing a thing   df apply lambda x  x sample frac 1       a  b 0  1  1 1  2  2 2  3  3 3  4  4 4  5  5 5  6  6

User · Answer

I resorted to adapting  root  s answer slightly and using the raw values directly  Of course  this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data   In  1   import numpy  In  2   import pandas  In  3   df   pandas DataFrame   A   range 10    B   range 10         In  4    timeit df apply numpy random shuffle  axis 0  1000 loops  best of 3  406   s per loop  In  5     timeit         for view in numpy rollaxis df values  1               numpy random shuffle view           10000 loops  best of 3  22 8   s per loop  In  6    timeit df apply numpy random shuffle  axis 1  1000 loops  best of 3  746   s per loop  In  7     timeit                                       for view in numpy rollaxis df values  0       numpy random shuffle view           10000 loops  best of 3  23 4   s per loop   Note that numpy rollaxis brings the specified axis to the first dimension and then let s us iterate over arrays with the remaining dimensions  i e   if we want to shuffle along the first dimension  columns   we need to roll the second dimension to the front  so that we apply the shuffling to views over the first dimension     In  8   numpy rollaxis df  0  shape Out 8    10  2    we can iterate over 10 arrays with shape  2    rows   In  9   numpy rollaxis df  1  shape Out 9    2  10    we can iterate over 2 arrays with shape  10    columns    Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis   def shuffle df  n 1  axis 0            df   df copy       axis   int not axis    pandas DataFrame is always 2D     for   in range n           for view in numpy rollaxis df values  axis               numpy random shuffle view      return df

User · Answer

Here is a work around I found if you want to only shuffle a subset of the DataFrame   shuffle to index   20 df   pd concat  df iloc np random permutation range shuffle to index     df iloc shuffle to index

User · Answer

Use numpy s random permuation function   In  1   df   pd DataFrame   A  range 10    B  range 10     In  2   df Out 2      A  B 0  0  0 1  1  1 2  2  2 3  3  3 4  4  4 5  5  5 6  6  6 7  7  7 8  8  8 9  9  9   In  3   df reindex np random permutation df index   Out 3      A  B 0  0  0 5  5  5 6  6  6 3  3  3 8  8  8 7  7  7 9  9  9 1  1  1 2  2  2 4  4  4

User · Answer

From the docs use sample     In  79   s   pd Series  0 1 2 3 4 5      When no arguments are passed  returns 1 row  In  80   s sample   Out 80    0    0 dtype  int64    One may specify either a number of rows  In  81   s sample n 3  Out 81    5    5 2    2 4    4 dtype  int64    Or a fraction of the rows  In  82   s sample frac 0 5  Out 82    5    5 4    4 1    1 dtype  int64

User · Answer

In  16   def shuffle df  n 1  axis 0                     df   df copy                for   in range n                    df apply np random shuffle  axis axis               return df                In  17   df   pd DataFrame   A  range 10    B  range 10     In  18   shuffle df   In  19   df Out 19       A  B 0  8  5 1  1  7 2  7  3 3  6  2 4  3  4 5  0  1 6  9  0 7  4  6 8  2  8 9  5  9

User · Answer

Sampling randomizes  so just sample the entire data frame   df sample frac 1

User · Answer

You can use  sklearn utils shuffle    requires sklearn 0 16 1 or higher to support Pandas data frames      Generate data import pandas as pd df   pd DataFrame   A  range 5    B  range 5    print  df   0   format df      Shuffle Pandas data frame import sklearn utils df   sklearn utils shuffle df  print   n ndf   0   format df     outputs   df     A  B 0  0  0 1  1  1 2  2  2 3  3  3 4  4  4   df     A  B 1  1  1 0  0  0 3  3  3 4  4  4 2  2  2   Then you can use df reset index   to reset the index column  if needs to be   df   df reset index drop True  print   n ndf   0   format df    outputs   df     A  B 0  1  1 1  0  0 2  4  4 3  2  2 4  3  3

[python] shuffling/permutating a DataFrame in pandas

Examples related to python

Examples related to numpy

Examples related to pandas