RandomForestClassfier fit ValueError could not convert string to float

Question

Given is a simple CSV file   A B C Hello Hi 0 Hola Bueno 1   Obviously the real dataset is far more complex than this  but this one reproduces the error  I m attempting to build a random forest classifier for it  like so   cols     A   B   C   col types     A   str   B   str   C   int  test   pd read csv  test csv   dtype col types   train y   test  C      1 train x   test cols   clf rf   RandomForestClassifier n estimators 50  clf rf fit train x  train y    But I just get this traceback when invoking fit     ValueError  could not convert string to float   Bueno    scikit-learn version is 0 16 1

User · Answer

You may not pass str to fit this kind of classifier   For example  if you have a feature column named  grade  which has 3 different grades    A B and C   you have to transfer those str  A   B   C  to matrix by encoder like the following   A    1 0 0   B    0 1 0   C    0 0 1    because the str does not have numerical meaning for the classifier    In scikit-learn  OneHotEncoder and LabelEncoder are available in inpreprocessing module  However OneHotEncoder does not support to fit transform   of string    ValueError  could not convert string to float  may happen during transform   You may use LabelEncoder to transfer from str to continuous numerical values  Then you are able to transfer by OneHotEncoder as you wish    In the Pandas dataframe  I have to encode all the data which are categorized to dtype object  The following code works for me and I hope this will help you       from sklearn import preprocessing     le   preprocessing LabelEncoder       for column name in train data columns          if train data column name  dtype    object              train data column name    le fit transform train data column name           else              pass

User · Answer

Well  there are important differences between how OneHot Encoding and Label Encoding work    Label Encoding will basically switch your String variables to int  In this case  the 1st class found will be coded as 1  the 2nd as 2      But this encoding creates an issue   Let s take the example of a variable Animal     quot Dog quot    quot Cat quot    quot Turtle quot    If you use Label Encoder on it  Animal will be  1  2  3   If you parse it to your machine learning model  it will interpret Dog is closer than Cat  and farther than Turtle  because distance between 1 and 2 is lower than distance between 1 and 3   Label encoding is actually excellent when you have ordinal variable  For example  if you have a value Age     quot Child quot    quot Teenager quot    quot Young Adult quot    quot Adult quot    quot Old quot    then using Label Encoding is perfect  Child is closer than Teenager than it is from Young Adult  You have a natural order on your variables  OneHot Encoding  also done by pd get dummies  is the best solution when you have no natural order between your variables   Let s take back the previous example of Animal     quot Dog quot    quot Cat quot    quot Turtle quot    It will create as much variable as classes you encounter  In my example  it will create 3 binary variables   Dog  Cat and Turtle  Then if you have Animal    quot Dog quot   encoding will make it Dog   1  Cat   0  Turtle   0  Then you can give this to your model  and he will never interpret that Dog is closer from Cat than from Turtle  But there are also cons to OneHotEncoding  If you have a categorical variable encountering 50 kind of classes eg   Dog  Cat  Turtle  Fish  Monkey      then it will create 50 binary variables  which can cause complexity issues  In this case  you can create your own classes and manually change variable eg   regroup Turtle  Fish  Dolphin  Shark in a same class called Sea Animals and then appy a OneHotEncoding

User · Answer

LabelEncoding worked for me  basically  you ve to encode your data feature-wise   mydata is a 2d array of string datatype     myData np genfromtxt filecsv  delimiter      dtype    a20   skip header 1    from sklearn import preprocessing le   preprocessing LabelEncoder   for i in range  NUMBER OF FEATURES        myData   i    le fit transform myData   i

User · Answer

As your input is in string you are getting value error message use countvectorizer it will convert data set in to sparse matrix and train your ml algorithm you will get the result

User · Answer

I had a similar issue and found that pandas get dummies   solved the problem  Specifically  it splits out columns of categorical data into sets of boolean columns  one new column for each unique value in each input column  In your case  you would replace train x   test cols  with   train x   pandas get dummies test cols     This transforms the train x Dataframe into the following form  which RandomForestClassifier can accept      C  A Hello  A Hola  B Bueno  B Hi 0  0        1       0        0     1 1  1        0       1        1     0

User · Answer

Indeed a one-hot encoder will work just fine here  convert any string and numerical categorical variables you want into 1 s and 0 s this way and random forest should not complain

User · Answer

You can t pass str to your model fit   method  as it mentioned here     The training input samples  Internally  it will be converted to dtype np float32 and if a sparse matrix is provided to a sparse csc matrix    Try transforming your data to float and give a try to LabelEncoder

User · Answer

You have to do some encoding before using fit  As it was told fit   does not accept Strings but you solve this   There are several classes that can be used     LabelEncoder   turn your string into incremental value OneHotEncoder   use One-of-K algorithm to transform your String into integer   Personally I have post almost the same question on StackOverflow some time ago  I wanted to have a scalable solution but didn t get any answer  I selected OneHotEncoder that binarize all the strings  It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required

[python] RandomForestClassfier.fit(): ValueError: could not convert string to float

Examples related to python

Examples related to scikit-learn

Examples related to random-forest