Save classifier to disk in scikit-learn

Question

How do I save a trained Naive Bayes classifier to disk and use it to predict data   I have the following sample program from the scikit-learn website   from sklearn import datasets iris   datasets load iris   from sklearn naive bayes import GaussianNB gnb   GaussianNB   y pred   gnb fit iris data  iris target  predict iris data  print  Number of mislabeled points    d     iris target    y pred  sum

User · Answer

sklearn externals joblib has been deprecated since 0 21 and will be removed in v0 23        usr local lib python3 7 site-packages sklearn externals joblib init py 15    FutureWarning  sklearn externals joblib is deprecated in 0 21 and will   be removed in 0 23  Please import this functionality directly from   joblib  which can be installed with  pip install joblib  If this   warning is raised when loading pickled models  you may need to   re-serialize those models with scikit-learn 0 21     warnings warn msg  category FutureWarning      Therefore  you need to install joblib    pip install joblib   and finally write the model to disk    import joblib from sklearn datasets import load digits from sklearn linear model import SGDClassifier   digits   load digits   clf   SGDClassifier   fit digits data  digits target   with open  myClassifier joblib pkl    wb   as f      joblib dump clf  f  compress 9    Now in order to read the dumped file all you need to run is    with open  myClassifier joblib pkl    rb   as f      my clf   joblib load f

User · Answer

sklearn estimators implement methods to make it easy for you to save relevant trained properties of an estimator  Some estimators implement   getstate   methods themselves  but others  like the GMM just use the base implementation which simply saves the objects inner dictionary   def   getstate   self       try          state   super BaseEstimator  self    getstate         except AttributeError          state   self   dict   copy        if type self    module   startswith  sklearn             return dict state items     sklearn version   version        else          return state   The recommended method to save your model to disc is to use the pickle module   from sklearn import datasets from sklearn svm import SVC iris   datasets load iris   X   iris data  100   2  y   iris target  100  model   SVC   model fit X y  import pickle with open  mymodel   wb   as f      pickle dump model f    However  you should save additional data so you can retrain your model in the future  or suffer dire consequences  such as being locked into an old version of sklearn    From the documentation      In order to rebuild a similar model with future versions of   scikit-learn  additional metadata should be saved along the pickled   model        The training data  e g  a reference to a immutable snapshot       The python source code used to generate the model       The versions of scikit-learn and its dependencies       The cross validation score obtained on the training data   This is especially true for Ensemble estimators that rely on the tree pyx module written in Cython such as IsolationForest   since it creates a coupling to the implementation  which is not guaranteed to be stable between versions of sklearn  It has seen backwards incompatible changes in the past   If your models become very large and loading becomes a nuisance  you can also use the more efficient joblib  From the documentation      In the specific case of the scikit  it may be more interesting to use   joblib   s replacement of pickle  joblib dump  amp  joblib load   which is   more efficient on objects that carry large numpy arrays internally as   is often the case for fitted scikit-learn estimators  but can only   pickle to the disk and not to a string

User · Answer

In many cases  particularly with text classification it is not enough just to store the classifier but you ll need to store the vectorizer as well so that you can vectorize your input in future   import pickle with open  model pkl    wb   as fout    pickle dump  vectorizer  clf   fout    future use case   with open  model pkl    rb   as fin    vectorizer  clf   pickle load fin   X new   vectorizer transform new samples  X new preds   clf predict X new    Before dumping the vectorizer  one can delete the stop words  property of vectorizer by   vectorizer stop words    None   to make dumping more efficient  Also if your classifier parameters is sparse  as in most text classification examples  you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption  loading and dumping  Sparsify the model by   clf sparsify     Which will automatically work for SGDClassifier but in case you know your model is sparse  lots of zeros in clf coef   then you can manually convert clf coef  into a csr scipy sparse matrix by   clf coef    scipy sparse csr matrix clf coef     and then you can store it more efficiently

User · Answer

What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections   So you have initialized your classifier and trained it for a long time with  clf   some classifier   clf fit X  y    After this you have two options   1  Using Pickle  import pickle   now you can save it to a file with open  filename pkl    wb   as f      pickle dump clf  f     and later you can load it with open  filename pkl    rb   as f      clf   pickle load f    2  Using Joblib  from sklearn externals import joblib   now you can save it to a file joblib dump clf   filename pkl      and later you can load it clf   joblib load  filename pkl     One more time it is helpful to read the above-mentioned links

User · Answer

Classifiers are just objects that can be pickled and dumped like any other  To continue your example  import cPickle   save the classifier with open  my dumped classifier pkl    wb   as fid      cPickle dump gnb  fid         load it again with open  my dumped classifier pkl    rb   as fid      gnb loaded   cPickle load fid   Edit  if you are using a sklearn Pipeline in which you have custom transformers that cannot be serialized by pickle  nor by joblib   then using Neuraxle s custom ML Pipeline saving is a solution where you can define your own custom step savers on a per-step basis  The savers are called for each step if defined upon saving  and otherwise joblib is used as default for steps without a saver

User · Answer

You can also use joblib dump and joblib load which is much more efficient at handling numerical arrays than the default python pickler   Joblib is included in scikit-learn    gt  gt  gt  import joblib  gt  gt  gt  from sklearn datasets import load digits  gt  gt  gt  from sklearn linear model import SGDClassifier   gt  gt  gt  digits   load digits    gt  gt  gt  clf   SGDClassifier   fit digits data  digits target   gt  gt  gt  clf score digits data  digits target     evaluate training error 0 9526989426822482   gt  gt  gt  filename     tmp digits classifier joblib pkl   gt  gt  gt      joblib dump clf  filename  compress 9    gt  gt  gt  clf2   joblib load filename   gt  gt  gt  clf2 SGDClassifier alpha 0 0001  class weight None  epsilon 0 1  eta0 0 0         fit intercept True  learning rate  optimal   loss  hinge   n iter 5         n jobs 1  penalty  l2   power t 0 5  rho 0 85  seed 0         shuffle False  verbose 0  warm start False   gt  gt  gt  clf2 score digits data  digits target  0 9526989426822482   Edit  in Python 3 8  it s now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5  which is not the default

[python] Save classifier to disk in scikit-learn

Examples related to python

Examples related to machine-learning

Examples related to scikit-learn

Examples related to classification