How does the class weight parameter in scikit-learn work

Question

I am having a lot of trouble understanding how the class weight parameter in scikit-learn s Logistic Regression operates   The Situation  I want to use logistic regression to do binary classification on a very unbalanced data set  The classes are labelled 0  negative  and 1  positive  and the observed data is in a ratio of about 19 1 with the majority of samples having negative outcome   First Attempt  Manually Preparing Training Data  I split the data I had into disjoint sets for training and testing  about 80 20   Then I randomly sampled the training data by hand to get training data in different proportions than 19 1  from 2 1 -  16 1   I then trained logistic regression on these different training data subsets and plotted recall    TP  TP FN   as a function of the different training proportions  Of course  the recall was computed on the disjoint TEST samples which had the observed proportions of 19 1  Note  although I trained the different models on different training data  I computed recall for all of them on the same  disjoint  test data   The results were as expected  the recall was about 60  at 2 1 training proportions and fell off rather fast by the time it got to 16 1  There were several proportions 2 1 -  6 1 where the recall was decently above 5    Second Attempt  Grid Search  Next  I wanted to test different regularization parameters and so I used GridSearchCV and made a grid of several values of the C parameter as well as the class weight parameter  To translate my n m proportions of negative positive training samples into the dictionary language of class weight I thought that I just specify several dictionaries as follows     0 0 67  1 0 33    expected 2 1   0 0 75  1 0 25    expected 3 1   0 0 8  1 0 2      expected 4 1   and I also included None and auto   This time the results were totally wacked  All my recalls came out tiny   lt  0 05  for every value of class weight except auto  So I can only assume that my understanding of how to set the class weight dictionary is wrong  Interestingly  the class weight value of  auto  in the grid search was around 59  for all values of C  and I guessed it balances to 1 1   My Questions   How do you properly use class weight to achieve different balances in training data from what you actually give it  Specifically  what dictionary do I pass to class weight to use n m proportions of negative positive training samples  If you pass various class weight dictionaries to GridSearchCV  during cross-validation will it rebalance the training fold data according to the dictionary but use the true given sample proportions for computing my scoring function on the test fold  This is critical since any metric is only useful to me if it comes from data in the observed proportions  What does the auto value of class weight do as far as proportions  I read the documentation and I assume  balances the data inversely proportional to their frequency  just means it makes it 1 1  Is this correct  If not  can someone clarify

User · Answer

The first answer is good for understanding how it works. But I wanted to understand how I should be using it in practice.

SUMMARY

for moderately imbalanced data WITHOUT noise, there is not much of a difference in applying class weights
for moderately imbalanced data WITH noise and strongly imbalanced, it is better to apply class weights
param class_weight="balanced" works decent in the absence of you wanting to optimize manually
with class_weight="balanced" you capture more true events (higher TRUE recall) but also you are more likely to get false alerts (lower TRUE precision)
- as a result, the total % TRUE might be higher than actual because of all the false positives
- AUC might misguide you here if the false alarms are an issue
no need to change decision threshold to the imbalance %, even for strong imbalance, ok to keep 0.5 (or somewhere around that depending on what you need)

NB

The result might differ when using RF or GBM. sklearn does not have class_weight="balanced" for GBM but lightgbm has LGBMClassifier(is_unbalance=False)

CODE

# scikit-learn==0.21.3
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np
import pandas as pd

# case: moderate imbalance
X, y = datasets.make_classification(n_samples=50*15, n_features=5, n_informative=2, n_redundant=0, random_state=1, weights=[0.8]) #,flip_y=0.1,class_sep=0.5)
np.mean(y) # 0.2

LogisticRegression(C=1e9).fit(X,y).predict(X).mean() # 0.184
(LogisticRegression(C=1e9).fit(X,y).predict_proba(X)[:,1]>0.5).mean() # 0.184 => same as first
LogisticRegression(C=1e9,class_weight={0:0.5,1:0.5}).fit(X,y).predict(X).mean() # 0.184 => same as first
LogisticRegression(C=1e9,class_weight={0:2,1:8}).fit(X,y).predict(X).mean() # 0.296 => seems to make things worse?
LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X).mean() # 0.292 => seems to make things worse?

roc_auc_score(y,LogisticRegression(C=1e9).fit(X,y).predict(X)) # 0.83
roc_auc_score(y,LogisticRegression(C=1e9,class_weight={0:2,1:8}).fit(X,y).predict(X)) # 0.86 => about the same
roc_auc_score(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)) # 0.86 => about the same

# case: strong imbalance
X, y = datasets.make_classification(n_samples=50*15, n_features=5, n_informative=2, n_redundant=0, random_state=1, weights=[0.95])
np.mean(y) # 0.06

LogisticRegression(C=1e9).fit(X,y).predict(X).mean() # 0.02
(LogisticRegression(C=1e9).fit(X,y).predict_proba(X)[:,1]>0.5).mean() # 0.02 => same as first
LogisticRegression(C=1e9,class_weight={0:0.5,1:0.5}).fit(X,y).predict(X).mean() # 0.02 => same as first
LogisticRegression(C=1e9,class_weight={0:1,1:20}).fit(X,y).predict(X).mean() # 0.25 => huh??
LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X).mean() # 0.22 => huh??
(LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict_proba(X)[:,1]>0.5).mean() # same as last

roc_auc_score(y,LogisticRegression(C=1e9).fit(X,y).predict(X)) # 0.64
roc_auc_score(y,LogisticRegression(C=1e9,class_weight={0:1,1:20}).fit(X,y).predict(X)) # 0.84 => much better
roc_auc_score(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)) # 0.85 => similar to manual
roc_auc_score(y,(LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict_proba(X)[:,1]>0.5).astype(int)) # same as last

print(classification_report(y,LogisticRegression(C=1e9).fit(X,y).predict(X)))
pd.crosstab(y,LogisticRegression(C=1e9).fit(X,y).predict(X),margins=True)
pd.crosstab(y,LogisticRegression(C=1e9).fit(X,y).predict(X),margins=True,normalize='index') # few prediced TRUE with only 28% TRUE recall and 86% TRUE precision so 6%*28%~=2%

print(classification_report(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)))
pd.crosstab(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X),margins=True)
pd.crosstab(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X),margins=True,normalize='index') # 88% TRUE recall but also lot of false positives with only 23% TRUE precision, making total predicted % TRUE > actual % TRUE

User · Answer

First off  it might not be good to just go by recall alone  You can simply achieve a recall of 100  by classifying everything as the positive class  I usually suggest using AUC for selecting parameters  and then finding a threshold for the operating point  say a given precision level  that you are interested in   For how class weight works  It penalizes mistakes in samples of class i  with class weight i  instead of 1  So higher class-weight means you want to put more emphasis on a class  From what you say it seems class 0 is 19 times more frequent than class 1  So you should increase the class weight of class 1 relative to class 0  say  0  1  1  9   If the class weight doesn t sum to 1  it will basically change the regularization parameter   For how class weight  auto  works  you can have a look at this discussion  In the dev version you can use class weight  balanced   which is easier to understand  it basically means replicating the smaller class until you have as many samples as in the larger one  but in an implicit way

[python] How does the class_weight parameter in scikit-learn work?

Examples related to python

Examples related to scikit-learn