[python] Random state (Pseudo-random number) in Scikit learn

I want to implement a machine learning algorithm in scikit learn, but I don't understand what this parameter random_state does? Why should I use it?

I also could not understand what is a Pseudo-random number.

This question is related to python random scikit-learn

The answer is


If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.


If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets. Refer below code:

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Doesn't matter how many times you run the code, the output will be 70.

70

Try to remove the random_state and run the code.

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Now here output will be different each time you execute the code.


If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.

If you see the Tree Classifiers - either DT or RF, they try to build a try using an optimal plan. Though most of the times this plan might be the same there could be instances where the tree might be different and so the predictions. When you try to debug your model you may not be able to recreate the same instance for which a Tree was built. So, to avoid all this hassle we use a random_state while building a DecisionTreeClassifier or RandomForestClassifier.

PS: You can go a bit in depth on how the Tree is built in DecisionTree to understand this better.

randomstate is basically used for reproducing your problem the same every time it is run. If you do not use a randomstate in traintestsplit, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue.

From Doc:

If int, randomstate is the seed used by the random number generator; If RandomState instance, randomstate is the random number generator; If None, the random number generator is the RandomState instance used by np.random.


sklearn.model_selection.train_test_split(*arrays, **options)[source]

Split arrays or matrices into random train and test subsets

Parameters: ... 
    random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. source: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

'''Regarding the random state, it is used in many randomized algorithms in sklearn to determine the random seed passed to the pseudo-random number generator. Therefore, it does not govern any aspect of the algorithm's behavior. As a consequence, random state values which performed well in the validation set do not correspond to those which would perform well in a new, unseen test set. Indeed, depending on the algorithm, you might see completely different results by just changing the ordering of training samples.''' source: https://stats.stackexchange.com/questions/263999/is-random-state-a-parameter-to-tune


random_state number splits the test and training datasets with a random manner. In addition to what is explained here, it is important to remember that random_state value can have significant effect on the quality of your model (by quality I essentially mean accuracy to predict). For instance, If you take a certain dataset and train a regression model with it, without specifying the random_state value, there is the potential that everytime, you will get a different accuracy result for your trained model on the test data. So it is important to find the best random_state value to provide you with the most accurate model. And then, that number will be used to reproduce your model in another occasion such as another research experiment. To do so, it is possible to split and train the model in a for-loop by assigning random numbers to random_state parameter:

for j in range(1000):

            X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j,     test_size=0.35)
            lr = LarsCV().fit(X_train, y_train)

            tr_score.append(lr.score(X_train, y_train))
            ts_score.append(lr.score(X_test, y_test))

        J = ts_score.index(np.max(ts_score))

        X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
        M = LarsCV().fit(X_train, y_train)
        y_pred = M.predict(X_test)`


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to random

How can I get a random number in Kotlin? scikit-learn random state in splitting dataset Random number between 0 and 1 in python In python, what is the difference between random.uniform() and random.random()? Generate random colors (RGB) Random state (Pseudo-random number) in Scikit learn How does one generate a random number in Apple's Swift language? How to generate a random string of a fixed length in Go? Generate 'n' unique random numbers within a range What does random.sample() method in python do?

Examples related to scikit-learn

LabelEncoder: TypeError: '>' not supported between instances of 'float' and 'str' UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples scikit-learn random state in splitting dataset LogisticRegression: Unknown label type: 'continuous' using sklearn in python Can anyone explain me StandardScaler? ImportError: No module named model_selection How to split data into 3 sets (train, validation and test)? How to convert a Scikit-learn dataset to a Pandas dataset? Accuracy Score ValueError: Can't Handle mix of binary and continuous target How can I plot a confusion matrix?