Random state Pseudo-random number in Scikit learn

Question

I want to implement a machine learning algorithm in scikit learn  but I don t understand what this parameter random state does  Why should I use it    I also could not understand what is a Pseudo-random number

User · Answer

If you don t specify the random state in your code  then every time you run execute  your code a new random value is generated and the train and test datasets would have different values each time   However  if a fixed value is assigned like random state   42 then no matter how many times you execute your code the result would be the same  i e  same values in train and test datasets

User · Answer

If you don t mention the random state in the code  then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time   However  if you use a particular value for random state random state   1 or any other value  everytime the result will be same i e  same values in train and test datasets  Refer below code   import pandas as pd  from sklearn model selection import train test split test series   pd Series range 100   size30split   train test split test series random state   1 test size    3  size25split   train test split test series random state   1 test size    25  common    element for element in size25split 0  if element in size30split 0   print len common     Doesn t matter how many times you run the code  the output will be 70    70   Try to remove the random state and run the code   import pandas as pd  from sklearn model selection import train test split test series   pd Series range 100   size30split   train test split test series test size    3  size25split   train test split test series test size    25  common    element for element in size25split 0  if element in size30split 0   print len common     Now here output will be different each time you execute the code

User · Answer

random state number splits the test and training datasets with a random manner  In addition to what is explained here  it is important to remember that random state value can have significant effect on the quality of your model  by quality I essentially mean accuracy to predict   For instance  If you take a certain dataset and train a regression model with it  without specifying the random state value  there is the potential that everytime  you will get a different accuracy result for your trained model on the test data   So it is important to find the best random state value to provide you with the most accurate model  And then  that number will be used to reproduce your model in another occasion such as another research experiment   To do so  it is possible to split and train the model in a for-loop by assigning random numbers to random state parameter   for j in range 1000                X train  X test  y train  y test   train test split X  y   random state  j      test size 0 35              lr   LarsCV   fit X train  y train               tr score append lr score X train  y train               ts score append lr score X test  y test            J   ts score index np max ts score            X train  X test  y train  y test   train test split X  y   random state  J  test size 0 35          M   LarsCV   fit X train  y train          y pred   M predict X test

User · Answer

train test split splits arrays or matrices into random train and test subsets  That means that everytime you run it without specifying random state  you will get a different result  this is expected behavior  For example   Run 1    gt  gt  gt  a  b   np arange 10  reshape  5  2    range 5   gt  gt  gt  train test split a  b   array   6  7            8  9            4  5      array   2  3            0  1      3  4  2    1  0     Run 2   gt  gt  gt  train test split a  b   array   8  9            4  5            0  1      array   6  7            2  3      4  2  0    3  1     It changes  On the other hand if you use random state some number  then you can guarantee that the output of Run 1 will be equal to the output of Run 2  i e  your split will be always the same   It doesn t matter what the actual random state number is 42  0  21      The important thing is that everytime you use 42  you will always get the same output the first time you make the split  This is useful if you want reproducible results  for example in the documentation  so that everybody can consistently see the same numbers when they run the examples   In practice I would say  you should set the random state to some fixed number while you test stuff  but then remove it in production if you really need a random  and not a fixed  split   Regarding your second question  a pseudo-random number generator is a number generator that generates almost truly random numbers  Why they are not truly random is out of the scope of this question and probably won t matter in your case  you can take a look here form more details

User · Answer

If there is no randomstate provided the system will use a randomstate that is generated internally  So  when you run the program multiple times you might see different train test data points and the behavior will be unpredictable  In case  you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program   If you see the Tree Classifiers - either DT or RF  they try to build a try using an optimal plan  Though most of the times this plan might be the same there could be instances where the tree might be different and so the predictions  When you try to debug your model you may not be able to recreate the same instance for which a Tree was built  So  to avoid all this hassle we use a random state while building a DecisionTreeClassifier or RandomForestClassifier   PS  You can go a bit in depth on how the Tree is built in DecisionTree to understand this better   randomstate is basically used for reproducing your problem the same every time it is run  If you do not use a randomstate in traintestsplit  every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue   From Doc   If int  randomstate is the seed used by the random number generator  If RandomState instance  randomstate is the random number generator  If None  the random number generator is the RandomState instance used by np random

User · Answer

sklearn model selection train test split  arrays    options  source    Split arrays or matrices into random train and test subsets  Parameters           random state   int  RandomState instance or None  optional  default None    If int  random state is the seed used by the random number generator  If RandomState instance  random state is the random number generator  If None  the random number generator is the RandomState instance used by np random  source  http   scikit-learn org stable modules generated sklearn model selection train test split html     Regarding the random state  it is used in many randomized algorithms in sklearn to determine the random seed passed to the pseudo-random number generator  Therefore  it does not govern any aspect of the algorithm s behavior  As a consequence  random state values which performed well in the validation set do not correspond to those which would perform well in a new  unseen test set  Indeed  depending on the algorithm  you might see completely different results by just changing the ordering of training samples     source  https   stats stackexchange com questions 263999 is-random-state-a-parameter-to-tune

[python] Random state (Pseudo-random number) in Scikit learn

Examples related to python

Examples related to random

Examples related to scikit-learn