Shuffle DataFrame rows

Question

I have the following DataFrame       Col1  Col2  Col3  Type 0      1     2     3     1 1      4     5     6     1     20     7     8     9     2 21    10    11    12     2     45    13    14    15     3 46    16    17    18     3       The DataFrame is read from a csv file  All rows which have Type 1 are on top  followed by the rows with Type 2  followed by the rows with Type 3  etc   I would like to shuffle the order of the DataFrame s rows  so that all Type s are mixed  A possible result could be       Col1  Col2  Col3  Type 0      7     8     9     2 1     13    14    15     3     20     1     2     3     1 21    10    11    12     2     45     4     5     6     1 46    16    17    18     3       How can I achieve this

User · Answer

Here is another way    df  rnd     np random rand len df   df   df sort values by  rnd   inplace True  drop  rnd   axis 1

User · Answer

shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame  Now sort the data frame according to index  Here goes your shuffled dataframe    import random df   pd DataFrame   a   1 2 3 4   b   5 6 7 8    index    i for i in range df shape 0    random shuffle index  df set index  index   sort index     output      a   b 0   2   6 1   1   5 2   3   7 3   4   8   Insert you data frame in the place of mine in above code

User · Answer

Following could be one of ways  dataframe   dataframe sample frac 1  random state 42  reset index drop True   where frac 1 means all rows of a dataframe random state 42 means keeping same order in each execution reset index drop True  means reinitialize index for randomized dataframe

User · Answer

TL DR  np random shuffle ndarray  can do the job  So  in your case   np random shuffle DataFrame values      DataFrame  under the hood  uses NumPy ndarray as data holder   You can check from DataFrame source code   So if you use np random shuffle    it would shuffles the array along the first axis of a multi-dimensional array  But index of the DataFrame remains unshuffled   Though  there are some points to consider      function returns none  In case you want to keep a copy of the original object  you have to do so before you pass to the function  sklearn utils shuffle    as user tj89 suggested  can designate random state along with another option to control output  You may want that for dev purpose  sklearn utils shuffle   is faster  But WILL SHUFFLE the axis info index  column  of the DataFrame along with the ndarray it contains    Benchmark result  between sklearn utils shuffle   and np random shuffle     ndarray  nd   sklearn utils shuffle nd    0 10793248389381915 sec  8x faster  np random shuffle nd    0 8897626010002568 sec  DataFrame  df   sklearn utils shuffle df    0 3183923360193148 sec  3x faster  np random shuffle df values    0 9357550159329548 sec     Conclusion  If it is okay to axis info index  column  to be shuffled along with ndarray  use sklearn utils shuffle    Otherwise  use np random shuffle     used code  import timeit setup       import numpy as np import pandas as pd import sklearn nd   np random random  1000  100   df   pd DataFrame nd       timeit timeit  nd   sklearn utils shuffle nd    setup setup  number 1000  timeit timeit  np random shuffle nd    setup setup  number 1000  timeit timeit  df   sklearn utils shuffle df    setup setup  number 1000  timeit timeit  np random shuffle df values    setup setup  number 1000    pythonbenchmarking

User · Answer

What is also useful  if you use it for Machine learning and want to seperate always the same data  you could use   df sample n len df   random state 42    this makes sure  that you keep your random choice always replicatable

User · Answer

I don t have enough reputation to comment this on the top post  so I hope someone else can do that for me   There was a concern raised that the first method    df sample frac 1    made a deep copy or just changed the dataframe  I ran the following code   print hex id df    print hex id df sample frac 1     print hex id df sample frac 1  reset index drop True       and my results were   0x1f8a784d400 0x1f8b9d65e10 0x1f8b9d65b70   which means the method is not returning the same object  as was suggested in the last comment  So this method does indeed make a shuffled copy

User · Answer

You can shuffle the rows of a dataframe by indexing with a shuffled index  For this  you can eg use np random permutation  but np random choice is also a possibility    In  12   df   pd read csv StringIO s   sep   s     In  13   df Out 13        Col1  Col2  Col3  Type 0      1     2     3     1 1      4     5     6     1 20     7     8     9     2 21    10    11    12     2 45    13    14    15     3 46    16    17    18     3  In  14   df iloc np random permutation len df    Out 14        Col1  Col2  Col3  Type 46    16    17    18     3 45    13    14    15     3 20     7     8     9     2 0      1     2     3     1 1      4     5     6     1 21    10    11    12     2   If you want to keep the index numbered from 1  2      n as in your example  you can simply reset the index  df shuffled reset index drop True

User · Answer

You can simply use sklearn for this  from sklearn utils import shuffle df   shuffle df

User · Answer

AFAIK the simplest solution is   df shuffled   df reindex np random permutation df index

User · Answer

The idiomatic way to do this with Pandas is to use the  sample method of your dataframe to sample all rows without replacement   df sample frac 1    The frac keyword argument specifies the fraction of rows to return in the random sample  so frac 1 means return all rows  in random order      Note  If you wish to shuffle your dataframe in-place and reset the index  you could do e g   df   df sample frac 1  reset index drop True    Here  specifying drop True prevents  reset index from creating a column containing the old index entries   Follow-up note  Although it may not look like the above operation is in-place  python pandas is smart enough not to do another malloc for the shuffled object  That is  even though the reference object has changed  by which I mean id df old  is not the same as id df new    the underlying C object is still the same  To show that this is indeed the case  you could run a simple memory profiler     python3 -m memory profiler   test py Filename    test py  Line      Mem usage    Increment   Line Contents                                                       5     68 5 MiB     68 5 MiB    profile      6                             def shuffle         7    847 8 MiB    779 3 MiB       df   pd DataFrame np random randn 100  1000000        8    847 9 MiB      0 1 MiB       df   df sample frac 1  reset index drop True

[python] Shuffle DataFrame rows

Examples related to python

Examples related to pandas

Examples related to dataframe

Examples related to permutation

Examples related to shuffle