TL;DR: np.random.shuffle(ndarray)
can do the job.
So, in your case
np.random.shuffle(DataFrame.values)
DataFrame
, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)
So if you use np.random.shuffle()
, it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame
remains unshuffled.
Though, there are some points to consider.
sklearn.utils.shuffle()
, as user tj89 suggested, can designate random_state
along with another option to control output. You may want that for dev purpose.sklearn.utils.shuffle()
is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame
along with the ndarray
it contains.between sklearn.utils.shuffle()
and np.random.shuffle()
.
nd = sklearn.utils.shuffle(nd)
0.10793248389381915 sec. 8x faster
np.random.shuffle(nd)
0.8897626010002568 sec
df = sklearn.utils.shuffle(df)
0.3183923360193148 sec. 3x faster
np.random.shuffle(df.values)
0.9357550159329548 sec
Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use
sklearn.utils.shuffle()
. Otherwise, usenp.random.shuffle()
import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''
timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)