I have the following DataFrame:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
...
20 7 8 9 2
21 10 11 12 2
...
45 13 14 15 3
46 16 17 18 3
...
The DataFrame is read from a csv file. All rows which have Type
1 are on top, followed by the rows with Type
2, followed by the rows with Type
3, etc.
I would like to shuffle the order of the DataFrame's rows, so that all Type
's are mixed. A possible result could be:
Col1 Col2 Col3 Type
0 7 8 9 2
1 13 14 15 3
...
20 1 2 3 1
21 10 11 12 2
...
45 4 5 6 1
46 16 17 18 3
...
How can I achieve this?
This question is related to
python
pandas
dataframe
permutation
shuffle
TL;DR: np.random.shuffle(ndarray)
can do the job.
So, in your case
np.random.shuffle(DataFrame.values)
DataFrame
, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)
So if you use np.random.shuffle()
, it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame
remains unshuffled.
Though, there are some points to consider.
sklearn.utils.shuffle()
, as user tj89 suggested, can designate random_state
along with another option to control output. You may want that for dev purpose.sklearn.utils.shuffle()
is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame
along with the ndarray
it contains.between sklearn.utils.shuffle()
and np.random.shuffle()
.
nd = sklearn.utils.shuffle(nd)
0.10793248389381915 sec. 8x faster
np.random.shuffle(nd)
0.8897626010002568 sec
df = sklearn.utils.shuffle(df)
0.3183923360193148 sec. 3x faster
np.random.shuffle(df.values)
0.9357550159329548 sec
Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use
sklearn.utils.shuffle()
. Otherwise, usenp.random.shuffle()
import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''
timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)
What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:
df.sample(n=len(df), random_state=42)
this makes sure, that you keep your random choice always replicatable
AFAIK the simplest solution is:
df_shuffled = df.reindex(np.random.permutation(df.index))
You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation
(but np.random.choice
is also a possibility):
In [12]: df = pd.read_csv(StringIO(s), sep="\s+")
In [13]: df
Out[13]:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
20 7 8 9 2
21 10 11 12 2
45 13 14 15 3
46 16 17 18 3
In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]:
Col1 Col2 Col3 Type
46 16 17 18 3
45 13 14 15 3
20 7 8 9 2
0 1 2 3 1
1 4 5 6 1
21 10 11 12 2
If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)
(I don't have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:
df.sample(frac=1)
made a deep copy or just changed the dataframe. I ran the following code:
print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))
and my results were:
0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70
which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.
You can simply use sklearn for this
from sklearn.utils import shuffle
df = shuffle(df)
shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe
import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()
output
a b
0 2 6
1 1 5
2 3 7
3 4 8
Insert you data frame in the place of mine in above code .
Following could be one of ways:
dataframe = dataframe.sample(frac=1, random_state=42).reset_index(drop=True)
where
frac=1 means all rows of a dataframe
random_state=42 means keeping same order in each execution
reset_index(drop=True) means reinitialize index for randomized dataframe
Here is another way:
df['rnd'] = np.random.rand(len(df))
df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)
Source: Stackoverflow.com