How to compute precision recall accuracy and f1-score for the multiclass case with scikit learn

Question

I m working in a sentiment analysis problem the data looks like this   label instances     5    1190     4     838     3     239     1     204     2     127   So my data is unbalanced since 1190 instances are labeled with 5  For the classification Im using scikit s SVC  The problem is I do not know how to balance my data in the right way in order to compute accurately the precision  recall  accuracy and f1-score for the multiclass case  So I tried the following approaches   First       wclf   SVC kernel  linear   C  1  class weight  1  10       wclf fit X  y      weighted prediction   wclf predict X test   print  Accuracy    accuracy score y test  weighted prediction  print  F1 score    f1 score y test  weighted prediction average  weighted   print  Recall    recall score y test  weighted prediction                                average  weighted   print  Precision    precision score y test  weighted prediction                                      average  weighted   print   n clasification report  n   classification report y test  weighted prediction  print   n confussion matrix  n  confusion matrix y test  weighted prediction    Second   auto wclf   SVC kernel  linear   C  1  class weight  auto   auto wclf fit X  y  auto weighted prediction   auto wclf predict X test   print  Accuracy    accuracy score y test  auto weighted prediction   print  F1 score    f1 score y test  auto weighted prediction                              average  weighted    print  Recall    recall score y test  auto weighted prediction                                average  weighted    print  Precision    precision score y test  auto weighted prediction                                      average  weighted    print   n clasification report  n   classification report y test auto weighted prediction   print   n confussion matrix  n  confusion matrix y test  auto weighted prediction    Third   clf   SVC kernel  linear   C  1  clf fit X  y  prediction   clf predict X test    from sklearn metrics import precision score        recall score  confusion matrix  classification report        accuracy score  f1 score  print  Accuracy    accuracy score y test  prediction  print  F1 score    f1 score y test  prediction  print  Recall    recall score y test  prediction  print  Precision    precision score y test  prediction  print   n clasification report  n   classification report y test prediction  print   n confussion matrix  n  confusion matrix y test  prediction    F1 score  usr local lib python2 7 site-packages sklearn metrics classification py 676  DeprecationWarning  The default  weighted  averaging is deprecated  and from version 0 18  use of precision  recall or F-score with multiclass or multilabel data or pos label None will result in an exception  Please set an explicit value for  average   one of  None   micro    macro    weighted    samples    In cross validation use  for instance  scoring  f1 weighted  instead of scoring  f1     sample weight sample weight   usr local lib python2 7 site-packages sklearn metrics classification py 1172  DeprecationWarning  The default  weighted  averaging is deprecated  and from version 0 18  use of precision  recall or F-score with multiclass or multilabel data or pos label None will result in an exception  Please set an explicit value for  average   one of  None   micro    macro    weighted    samples    In cross validation use  for instance  scoring  f1 weighted  instead of scoring  f1     sample weight sample weight   usr local lib python2 7 site-packages sklearn metrics classification py 1082  DeprecationWarning  The default  weighted  averaging is deprecated  and from version 0 18  use of precision  recall or F-score with multiclass or multilabel data or pos label None will result in an exception  Please set an explicit value for  average   one of  None   micro    macro    weighted    samples    In cross validation use  for instance  scoring  f1 weighted  instead of scoring  f1     sample weight sample weight   0 930416613529   However  Im getting warnings like this    usr local lib python2 7 site-packages sklearn metrics classification py 1172  DeprecationWarning  The default  weighted  averaging is deprecated  and from version 0 18  use of precision  recall or F-score with  multiclass or multilabel data or pos label None will result in an  exception  Please set an explicit value for  average   one of  None    micro    macro    weighted    samples    In cross validation use  for  instance  scoring  f1 weighted  instead of scoring  f1    How can I deal correctly with my unbalanced data in order to compute in the right way classifier s metrics

User · Answer

First of all it s a little bit harder using just counting analysis to tell if your data is unbalanced or not  For example  1 in 1000 positive observation is just a noise  error or a breakthrough in science  You never know  So it s always better to use all your available knowledge and choice its status with all wise   Okay  what if it s really unbalanced  Once again     look to your data  Sometimes you can find one or two observation multiplied by hundred times  Sometimes it s useful to create this fake one-class-observations  If all the data is clean next step is to use class weights in prediction model   So what about multiclass metrics  In my experience none of your metrics is usually used  There are two main reasons  First  it s always better to work with probabilities than with solid prediction  because how else could you separate models with 0 9 and 0 6 prediction if they both give you the same class   And second  it s much easier to compare your prediction models and build new ones depending on only one good metric  From my experience I could recommend logloss or MSE  or just mean squared error    How to fix sklearn warnings  Just simply  as yangjie noticed  overwrite average parameter with one of these  values   micro   calculate metrics globally    macro   calculate metrics for each label  or  weighted   same as macro but with auto weights    f1 score y test  prediction  average  weighted     All your Warnings came after calling metrics functions with default average value  binary  which is inappropriate for multiclass prediction  Good luck and have fun with machine learning   Edit  I found another answerer recommendation to switch to regression approaches  e g  SVR  with which I cannot agree  As far as I remember there is no even such a thing as multiclass regression  Yes there is multilabel regression which is far different and yes it s possible in some cases switch between regression and classification  if classes somehow sorted  but it pretty rare   What I would recommend  in scope of scikit-learn  is to try another very powerful classification tools  gradient boosting  random forest  my favorite   KNeighbors and many more   After that you can calculate arithmetic or geometric mean between predictions and most of the time you ll get even better result   final prediction    KNNprediction   RFprediction     0 5

User · Answer

Lot of very detailed answers here but I don t think you are answering the right questions  As I understand the question  there are two concerns   How to I score a multiclass problem  How do I deal with unbalanced data   1  You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems  Ex   from sklearn metrics import precision recall fscore support as score  predicted    1 2 3 4 5 1 2 1 1 4 5   y test    1 2 3 4 5 1 2 1 1 4 1   precision  recall  fscore  support   score y test  predicted   print  precision      format precision   print  recall      format recall   print  fscore      format fscore   print  support      format support    This way you end up with tangible and interpretable numbers for each of the classes    Label   Precision   Recall   FScore   Support    ------- ----------- -------- -------- ---------    1       94          83       0 88     204         2       71          50       0 54     127                                                           4       80          98       0 89     838         5       93          81       0 91     1190       Then    2      you can tell if the unbalanced data is even a problem  If the scoring for the less represented classes  class 1 and 2  are lower than for the classes with more training samples  class 4 and 5  then you know that the unbalanced data is in fact a problem  and you can act accordingly  as described in some of the other answers in this thread  However  if the same class distribution is present in the data you want to predict on  your unbalanced training data is a good representative of the data  and hence  the unbalance is a good thing

User · Answer

Posed question  Responding to the question  what metric should be used for multi-class classification with imbalanced data   Macro-F1-measure   Macro Precision and Macro Recall can be also used  but they are not so easily interpretable as for binary classificaion  they are already incorporated into F-measure  and excess metrics complicate methods comparison  parameters tuning  and so on    Micro averaging are sensitive to class imbalance  if your method  for example  works good for the most common labels and totally messes others  micro-averaged metrics show good results   Weighting averaging isn t well suited for imbalanced data  because it weights by counts of labels  Moreover  it is too hardly interpretable and unpopular  for instance  there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through      Sokolova  Marina  and Guy Lapalme   A systematic analysis of   performance measures for classification tasks   Information Processing    amp  Management 45 4  2009   427-437    Application-specific question  However  returning to your task  I d research 2 topics    metrics commonly used for your specific task - it lets  a  to compare your method with others and understand if you do something wrong  and  b  to not explore this by yourself and reuse someone else s findings   cost of different errors of your methods - for example  use-case of your application may rely on 4- and 5-star reviewes only - in this case  good metric should count only these 2 labels    Commonly used metrics  As I can infer after looking through literature  there are 2 main evaluation metrics    Accuracy  which is used  e g  in       Yu  April  and Daryl Chang   Multiclass Sentiment Prediction using   Yelp Business      link  - note that the authors work with almost the same distribution of ratings  see Figure 5      Pang  Bo  and Lillian Lee   Seeing stars  Exploiting class   relationships for sentiment categorization with respect to rating   scales   Proceedings of the 43rd Annual Meeting on Association for   Computational Linguistics  Association for Computational Linguistics    2005     link    MSE  or  less often  Mean Absolute Error - MAE  - see  for example       Lee  Moontae  and R  Grafe   Multiclass sentiment analysis with   restaurant reviews   Final Projects from CS N 224  2010      link  - they explore both accuracy and MSE  considering the latter to be better     Pappas  Nikolaos  Rue Marconi  and Andrei Popescu-Belis   Explaining   the Stars  Weighted Multiple-Instance Learning for Aspect-Based   Sentiment Analysis   Proceedings of the 2014 Conference on Empirical   Methods In Natural Language Processing  No  EPFL-CONF-200899  2014     link  - they utilize scikit-learn for evaluation and baseline approaches and state that their code is available  however  I can t find it  so if you need it  write a letter to the authors  the work is pretty new and seems to be written in Python   Cost of different errors  If you care more about avoiding gross blunders  e g  assinging 1-star to 5-star review or something like that  look at MSE   if difference matters  but not so much  try MAE  since it doesn t square diff   otherwise stay with Accuracy   About approaches  not metrics  Try regression approaches  e g  SVR  since they generally outperforms Multiclass classifiers like SVC or OVA SVM

User · Answer

I think there is a lot of confusion about which weights are used for what  I am not sure I know precisely what bothers you so I am going to cover different topics  bear with me      Class weights  The weights from the class weight parameter are used to train the classifier  They are not used in the calculation of any of the metrics you are using  with different class weights  the numbers will be different simply because the classifier is different   Basically in every scikit-learn classifier  the class weights are used to tell your model how important a class is  That means that during the training  the classifier will make extra efforts to classify properly the classes with high weights  How they do that is algorithm-specific  If you want details about how it works for SVC and the doc does not make sense to you  feel free to mention it   The metrics  Once you have a classifier  you want to know how well it is performing   Here you can use the metrics you mentioned  accuracy  recall score  f1 score     Usually when the class distribution is unbalanced  accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class    I will not detail all these metrics but note that  with the exception of accuracy  they are naturally applied at the class level  as you can see in this print of a classification report they are defined for each class  They rely on concepts such as true positives or false negative that require defining which class is the positive one                precision    recall  f1-score   support            0       0 65      1 00      0 79        17           1       0 57      0 75      0 65        16           2       0 33      0 06      0 10        17 avg   total       0 52      0 60      0 51        50   The warning  F1 score  usr local lib python2 7 site-packages sklearn metrics classification py 676  DeprecationWarning  The  default  weighted  averaging is deprecated  and from version 0 18   use of precision  recall or F-score with multiclass or multilabel data   or pos label None will result in an exception  Please set an explicit  value for  average   one of  None   micro    macro    weighted     samples    In cross validation use  for instance   scoring  f1 weighted  instead of scoring  f1     You get this warning because you are using the f1-score  recall and precision without defining how they should be computed  The question could be rephrased  from the above classification report  how do you output one global number for the f1-score  You could    Take the average of the f1-score for each class  that s the avg   total result above  It s also called macro averaging  Compute the f1-score using the global count of true positives   false negatives  etc   you sum the number of true positives   false negatives for each class   Aka micro averaging  Compute a weighted average of the f1-score  Using  weighted  in scikit-learn will weigh the f1-score by the support of the class  the more elements a class has  the more important the f1-score for this class in the computation    These are 3 of the options in scikit-learn  the warning is there to say you have to pick one  So you have to specify an average argument for the score method     Which one you choose is up to how you want to measure the performance of the classifier  for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5  If you use weighted averaging however you ll get more importance for the class 5   The whole argument specification in these metrics is not super-clear in scikit-learn right now  it will get better in version 0 18 according to the docs  They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it   Computing scores  Last thing I want to mention  feel free to skip it if you re aware of it  is that scores are only meaningful if they are computed on data that the classifier has never seen   This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant   Here s a way to do it using StratifiedShuffleSplit  which gives you a random splits of your data  after shuffling  that preserve the label distribution   from sklearn datasets import make classification from sklearn cross validation import StratifiedShuffleSplit from sklearn metrics import accuracy score  f1 score  precision score  recall score  classification report  confusion matrix    We use a utility to generate artificial classification data  X  y   make classification n samples 100  n informative 10  n classes 3  sss   StratifiedShuffleSplit y  n iter 1  test size 0 5  random state 0  for train idx  test idx in sss      X train  X test  y train  y test   X train idx   X test idx   y train idx   y test idx      svc fit X train  y train      y pred   svc predict X test      print f1 score y test  y pred  average  macro        print precision score y test  y pred  average  macro        print recall score y test  y pred  average  macro          Hope this helps

[python] How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

Examples related to python

Examples related to machine-learning

Examples related to nlp

Examples related to artificial-intelligence

Examples related to scikit-learn