Label encoding across multiple columns in scikit-learn

Question

I m trying to use scikit-learn s LabelEncoder to encode a pandas DataFrame of string labels  As the dataframe has many  50   columns  I want to avoid creating a LabelEncoder object for each column  I d rather just have one big LabelEncoder objects that works across all my columns of data     Throwing the entire DataFrame into LabelEncoder creates the below error   Please bear in mind that I m using dummy data here  in actuality I m dealing with about 50 columns of string labeled data  so need a solution that doesn t reference any columns by name    import pandas from sklearn import preprocessing   df   pandas DataFrame        pets     cat    dog    cat    monkey    dog    dog          owner     Champ    Ron    Brick    Champ    Veronica    Ron          location     San Diego    New York    New York    San Diego    San Diego                      New York       le   preprocessing LabelEncoder    le fit df       Traceback  most recent call last           File     line 1  in          File   Users bbalin anaconda lib python2 7 site-packages sklearn preprocessing label py   line 103  in fit           y   column or 1d y  warn True          File   Users bbalin anaconda lib python2 7 site-packages sklearn utils validation py   line 306  in column or 1d           raise ValueError  bad input shape  0   format shape         ValueError  bad input shape  6  3    Any thoughts on how to get around this problem

User · Answer

Using Neuraxle     TLDR  You here can use the FlattenForEach wrapper class to simply transform your df like  FlattenForEach LabelEncoder    then unflatten True  fit transform df      With this method  your label encoder will be able to fit and transform within a regular scikit-learn Pipeline  Let s simply import    from sklearn preprocessing import LabelEncoder from neuraxle steps column transformer import ColumnTransformer from neuraxle steps loop import FlattenForEach   Same shared encoder for columns   Here is how one shared LabelEncoder will be applied on all the data to encode it       p   FlattenForEach LabelEncoder    then unflatten True    Result        p  predicted output   p fit transform df values      expected output   np array            6  7  6  8  7  7            1  3  0  1  5  3            4  2  2  4  4  2         transpose       assert np array equal predicted output  expected output    Different encoders per column   And here is how a first standalone LabelEncoder will be applied on the pets  and a second will be shared for the columns owner and location  So to be precise  we here have a mix of different and shared label encoders       p   ColumnTransformer             A different encoder will be used for column 0 with name  pets            0  FlattenForEach LabelEncoder    then unflatten True              A shared encoder will be used for column 1 and 2   owner  and  location             1  2   FlattenForEach LabelEncoder    then unflatten True           n dimension 2    Result        p  predicted output   p fit transform df values      expected output   np array            0  1  0  2  1  1            1  3  0  1  5  3            4  2  2  4  4  2         transpose       assert np array equal predicted output  expected output

User · Answer

Following up on the comments raised on the solution of  PriceHardman I would propose the following version of the class   class LabelEncodingColoumns BaseEstimator  TransformerMixin   def   init   self  cols None       pdu  is cols input valid cols      self cols   cols     self les    col  LabelEncoder   for col in cols      self  is fitted   False  def transform self  df    transform params               Scaling   cols   of   df   using the fitting      Parameters     ----------     df   DataFrame         DataFrame to be preprocessed             if not self  is fitted          raise NotFittedError  Fitting was not preformed       pdu  is cols subset of df cols self cols  df       df   df copy        label enc dict          for col in self cols          label enc dict col    self les col  transform df col        labelenc cols   pd DataFrame label enc dict            The index of the resulting DataFrame should be assigned and           equal to the one of the original DataFrame  Otherwise  upon           concatenation NaNs will be introduced          index df index            for col in self cols          df col    labelenc cols col      return df  def fit self  df  y None    fit params               Fitting the preprocessing      Parameters     ----------     df   DataFrame         Data to use for fitting          In many cases  should be   X train                pdu  is cols subset of df cols self cols  df      for col in self cols          self les col  fit df col       self  is fitted   True     return self   This class fits the encoder on the training set and uses the fitted version when transforming  Initial version of the code can be found here

User · Answer

import pandas as pd from sklearn preprocessing import LabelEncoder  train pd read csv      train csv     X train loc     waterpoint type group   status   waterpoint type   source class    values   Create a label encoder object  def MultiLabelEncoder columnlist dataframe       for i in columnlist           labelencoder X LabelEncoder           dataframe i  labelencoder X fit transform dataframe i   columnlist   waterpoint type group   status   waterpoint type   source class   source type   MultiLabelEncoder columnlist train    Here i am reading a csv from location and in function i am passing the column list i want to labelencode and the dataframe I want to apply this

User · Answer

After lots of search and experimentation with some answers here and elsewhere  I think your answer is here      pd DataFrame columns df columns    data LabelEncoder   fit transform df values flatten    reshape df shape     This will preserve category names across columns   import pandas as pd from sklearn preprocessing import LabelEncoder  df   pd DataFrame    A   B   C   D   E   F   G   I   K   H                         A   E   H   F   G   I   K                                  A   C   I   F   H   G                                    columns   A1    A2    A3   A4    A5    A6    A7    A8    A9    A10     pd DataFrame columns df columns  data LabelEncoder   fit transform df values flatten    reshape df shape        A1  A2  A3  A4  A5  A6  A7  A8  A9  A10 0   1   2   3   4   5   6   7   9   10  8 1   1   5   8   6   7   9   10  0   0   0 2   1   3   9   6   8   7   0   0   0   0

User · Answer

It is possible to do this all in pandas directly and is well-suited for a unique ability of the replace method   First  let s make a dictionary of dictionaries mapping the columns and their values to their new replacement values   transform dict      for col in df columns      cats   pd Categorical df col   categories     d          for i  cat in enumerate cats           d cat    i     transform dict col    d  transform dict   location     New York   0   San Diego   1     owner     Brick   0   Champ   1   Ron   2   Veronica   3     pets     cat   0   dog   1   monkey   2     Since this will always be a one to one mapping  we can invert the inner dictionary to get a mapping of the new values back to the original   inverse transform dict      for col  d in transform dict items        inverse transform dict col     v k for k  v in d items     inverse transform dict   location    0   New York   1   San Diego      owner    0   Brick   1   Champ   2   Ron   3   Veronica      pets    0   cat   1   dog   2   monkey      Now  we can use the unique ability of the replace method to take a nested list of dictionaries and use the outer keys as the columns  and the inner keys as the values we would like to replace   df replace transform dict     location  owner  pets 0         1      1     0 1         0      2     1 2         0      0     0 3         1      1     2 4         1      3     1 5         0      2     1   We can easily go back to the original by again chaining the replace method  df replace transform dict  replace inverse transform dict      location     owner    pets 0  San Diego     Champ     cat 1   New York       Ron     dog 2   New York     Brick     cat 3  San Diego     Champ  monkey 4  San Diego  Veronica     dog 5   New York       Ron     dog

User · Answer

I checked the source code  https   github com scikit-learn scikit-learn blob master sklearn preprocessing label py  of LabelEncoder  It was based on a set of numpy transformation  which one of those is np unique    And this function only takes 1-d array input   correct me if I am wrong      Very Rough ideas    first  identify which columns needed LabelEncoder  then loop through each column     def cat var df           Identify categorical features        Parameters     ----------     df  original df after missing operations       Returns     -------     cat var df  summary df with col index and col name for all categorical vars             col type   df dtypes     col names   list df       cat var index    i for i  x in enumerate col type  if x   object       cat var name    x for i  x in enumerate col names  if i in cat var index       cat var df   pd DataFrame   cat ind   cat var index                                   cat name   cat var name        return cat var df    from sklearn preprocessing import LabelEncoder   def column encoder df  cat var list          Encoding categorical feature in the dataframe      Parameters     ----------     df  input dataframe      cat var list  categorical feature index and name  from cat var function      Return     ------     df  new dataframe where categorical features are encoded     label list  classes  attribute for all encoded features               label list          cat var df   cat var df      cat list   cat var df loc     cat name        for index  cat feature in enumerate cat list             le   LabelEncoder            le fit df loc    cat feature               label list append list le classes             df loc    cat feature    le transform df loc    cat feature        return df  label list    The returned df would be the one after encoding  and label list will show you what all those values means in the corresponding column   This is a snippet from a data process script I wrote for work  Let me know if you think there could be any further improvement      EDIT   Just want to mention here that the methods above work with data frame with no missing the best  Not sure how it is working toward data frame contains missing data   I had a deal with missing procedure before execute above methods

User · Answer

This is a year-and-a-half after the fact  but I too  needed to be able to  transform   multiple pandas dataframe columns at once  and be able to  inverse transform   them as well   This expands upon the excellent suggestion of  PriceHardman above   class MultiColumnLabelEncoder LabelEncoder               Wraps sklearn LabelEncoder functionality for use on multiple columns of a     pandas dataframe               def   init   self  columns None           self columns   columns      def fit self  dframe                       Fit label encoder to pandas columns           Access individual column classes via indexig  self all classes            Access individual column encoders via indexing          self all encoders                         if columns are provided  iterate through and get  classes           if self columns is not None                ndarray to hold LabelEncoder   classes  for each               column  should match the shape of specified  columns              self all classes    np ndarray shape self columns shape                                             dtype object              self all encoders    np ndarray shape self columns shape                                              dtype object              for idx  column in enumerate self columns                     fit LabelEncoder to get  classes   for the column                 le   LabelEncoder                   le fit dframe loc    column  values                    append the  classes   to our ndarray container                 self all classes  idx     column                                            np array le classes  tolist                                                      dtype object                     append this column s encoder                 self all encoders  idx    le         else                no columns specified  assume all are to be encoded             self columns   dframe iloc       columns             self all classes    np ndarray shape self columns shape                                             dtype object              for idx  column in enumerate self columns                   le   LabelEncoder                   le fit dframe loc    column  values                  self all classes  idx     column                                            np array le classes  tolist                                                      dtype object                   self all encoders  idx    le         return self      def fit transform self  dframe                       Fit label encoder and return encoded labels           Access individual column classes via indexing          self all classes            Access individual column encoders via indexing          self all encoders            Access individual column encoded labels via indexing          self all labels                         if columns are provided  iterate through and get  classes           if self columns is not None                ndarray to hold LabelEncoder   classes  for each               column  should match the shape of specified  columns              self all classes    np ndarray shape self columns shape                                             dtype object              self all encoders    np ndarray shape self columns shape                                              dtype object              self all labels    np ndarray shape self columns shape                                            dtype object              for idx  column in enumerate self columns                     instantiate LabelEncoder                 le   LabelEncoder                     fit and transform labels in the column                 dframe loc    column                         le fit transform dframe loc    column  values                    append the  classes   to our ndarray container                 self all classes  idx     column                                            np array le classes  tolist                                                      dtype object                   self all encoders  idx    le                 self all labels  idx    le         else                no columns specified  assume all are to be encoded             self columns   dframe iloc       columns             self all classes    np ndarray shape self columns shape                                             dtype object              for idx  column in enumerate self columns                   le   LabelEncoder                   dframe loc    column    le fit transform                          dframe loc    column  values                  self all classes  idx     column                                            np array le classes  tolist                                                      dtype object                   self all encoders  idx    le         return dframe loc    self columns  values      def transform self  dframe                       Transform labels to normalized encoding                      if self columns is not None              for idx  column in enumerate self columns                   dframe loc    column    self all encoders                       idx  transform dframe loc    column  values          else              self columns   dframe iloc       columns             for idx  column in enumerate self columns                   dframe loc    column    self all encoders  idx                        transform dframe loc    column  values          return dframe loc    self columns  values      def inverse transform self  dframe                       Transform labels back to original encoding                      if self columns is not None              for idx  column in enumerate self columns                   dframe loc    column    self all encoders  idx                        inverse transform dframe loc    column  values          else              self columns   dframe iloc       columns             for idx  column in enumerate self columns                   dframe loc    column    self all encoders  idx                        inverse transform dframe loc    column  values          return dframe loc    self columns  values   Example   If df and df copy   are mixed-type pandas dataframes  you can apply the MultiColumnLabelEncoder   to the dtype object columns in the following way     get  object  columns df object columns   df iloc       select dtypes include   object    columns df copy object columns   df copy iloc       select dtypes include   object    columns    instantiate  MultiColumnLabelEncoder  mcle   MultiColumnLabelEncoder columns object columns     fit to  df  data mcle fit df     transform the  df  data mcle transform df     returns output like below array   1  0  0       1  1  0           0  5  1       1  1  2           1  1  1       1  1  2                        3  5  1       1  1  2      transform  df copy  data mcle transform df copy     returns output like below  assuming the respective columns    of  df copy  contain the same unique values as that particular    column in  df  array   1  0  0       1  1  0           0  5  1       1  1  2           1  1  1       1  1  2                        3  5  1       1  1  2      inverse  df  data mcle inverse transform df     outputs data like below array    August    Friday    2013         N    N    CA             April    Tuesday    2014         N    N    NJ             August    Monday    2014         N    N    NJ                          February    Tuesday    2014         N    N    NJ             April    Tuesday    2014         N    N    NJ             March    Tuesday    2013         N    N    NJ     dtype object     inverse  df copy  data mcle inverse transform df copy     outputs data like below array    August    Friday    2013         N    N    CA             April    Tuesday    2014         N    N    NJ             August    Monday    2014         N    N    NJ                          February    Tuesday    2014         N    N    NJ             April    Tuesday    2014         N    N    NJ             March    Tuesday    2013         N    N    NJ     dtype object    You can access individual column classes  column labels  and column encoders used to fit each column via indexing   mcle all classes  mcle all encoders  mcle all labels

User · Answer

Instead of LabelEncoder we can use OrdinalEncoder from scikit learn  which allows multi-column encoding   Encode categorical features as an integer array  The input to this transformer should be an array-like of integers or strings  denoting the values taken on by categorical  discrete  features  The features are converted to ordinal integers  This results in a single column of integers  0 to n categories - 1  per feature    gt  gt  gt  from sklearn preprocessing import OrdinalEncoder  gt  gt  gt  enc   OrdinalEncoder    gt  gt  gt  X      Male   1     Female   3     Female   2    gt  gt  gt  enc fit X  OrdinalEncoder    gt  gt  gt  enc categories   array   Female    Male    dtype object   array  1  2  3   dtype object    gt  gt  gt  enc transform    Female   3     Male   1    array   0   2            1   0      Both the description and example were copied from its documentation page which you can find here  https   scikit-learn org stable modules generated sklearn preprocessing OrdinalEncoder html sklearn preprocessing OrdinalEncoder

User · Answer

You can easily do this though  df apply LabelEncoder   fit transform   EDIT2  In scikit-learn 0 20  the recommended way is OneHotEncoder   fit transform df   as the OneHotEncoder now supports string input  Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer  EDIT3  Since this answer is over a year ago  and generated many upvotes  including a bounty   I should probably extend this further  For inverse transform and transform  you have to do a little bit of hack  from collections import defaultdict d   defaultdict LabelEncoder   With this  you now retain all columns LabelEncoder as dictionary    Encoding the variable fit   df apply lambda x  d x name  fit transform x      Inverse the encoded fit apply lambda x  d x name  inverse transform x      Using the dictionary to label future data df apply lambda x  d x name  transform x    MOAR EDIT  Using Neuraxle s FlattenForEach step  it s possible to do this as well to use the same LabelEncoder on all the flattened data at once  FlattenForEach LabelEncoder    then unflatten True  fit transform df   For using separate LabelEncoders depending for your columns of data  or if only some of your columns of data needs to be label-encoded and not others  then using a ColumnTransformer is a solution that allows for more control on your column selection and your LabelEncoder instances

User · Answer

How about this   def MultiColumnLabelEncode choice  columns  X       LabelEncoders          if choice     encode           for i in enumerate columns               LabelEncoders append LabelEncoder            i 0             for cols in columns              X    cols    LabelEncoders i  fit transform X    cols               i    1     elif choice     decode            for cols in columns              X    cols    LabelEncoders i  inverse transform X    cols               i    1     else          print  Please select correct parameter  choice   Available parameters  encode decode     It is not the most efficient  however it works and it is super simple

User · Answer

if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python   def stringtocategory dataset                author puja sharma      see The function label encodes the object type columns and gives label      encoded and inverse tranform of the label encoded data      param dataset dataframe on whoes column the label encoding has to be done      return label encoded and inverse tranform of the label encoded data             data original   dataset       data tranformed   dataset       for y in dataset columns          check the dtype of the column object type contains strings or chars        if  dataset y  dtype    object             print  The string type features are        y            le   preprocessing LabelEncoder             le fit dataset y  unique               label encoded data           data tranformed y    le transform dataset y              inverse label transform  data           data original y    le inverse transform data tranformed y      return data tranformed data original

User · Answer

Assuming you are simply trying to get a sklearn preprocessing LabelEncoder   object that can be used to represent your columns  all you have to do is   le fit df columns    In the above code you will have a unique number corresponding to each column  More precisely  you will have a 1 1 mapping of df columns to le transform df columns get values     To get a column s encoding  simply pass it to le transform       As an example  the following will get the encoding for each column   le transform df columns get values      Assuming you want to create a sklearn preprocessing LabelEncoder   object for all of your row labels you can do the following   le fit  y for x in df get values   for y in x     In this case  you most likely have non-unique row labels  as shown in your question   To see what classes the encoder created you can do le classes   You ll note that this should have the same elements as in set y for x in df get values   for y in x   Once again to convert a row label to an encoded label use le transform       As an example  if you want to retrieve the label for the first column in the df columns array and the first row  you could do this   le transform  df get value 0  df columns 0       The question you had in your comment is a bit more complicated  but can still be accomplished   le fit  str z  for z in set  x 0   y  for x in df iteritems   for y in x 1       The above code does the following    Make a unique combination of all of the pairs of  column  row  Represent each pair as a string version of the tuple  This is a workaround to overcome the LabelEncoder class not supporting tuples as a class name  Fits the new items to the LabelEncoder    Now to use this new model it s a bit more complicated  Assuming we want to extract the representation for the same item we looked up in the previous example  the first column in df columns and the first row   we can do this   le transform  str  df columns 0   df get value 0  df columns 0         Remember that each lookup is now a string representation of a tuple that contains the  column  row

User · Answer

Mainly used  Alexander answer but had to make some changes -   cols need mapped     col1    col2    mapper    col   cat  n for n  cat in enumerate df col  astype  category   cat categories         for col in df cols need mapped    for c in cols need mapped       df c    df c  map mapper c     Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the  map   function like I did above

User · Answer

If you have all the features of type object then the first answer written above works well https   stackoverflow com a 31939145 5840973  But  Suppose when we have mixed type columns  Then we can fetch the list of features names of type object type programmatically and then Label Encode them   Fetch features of type Object objFeatures   dataframe select dtypes include  quot object quot   columns   Iterate a loop for features of type object from sklearn import preprocessing le   preprocessing LabelEncoder    for feat in objFeatures      dataframe feat    le fit transform dataframe feat  astype str      dataframe info

User · Answer

No  LabelEncoder does not do this  It takes 1-d arrays of class labels and produces 1-d arrays  It s designed to handle class labels in classification problems  not arbitrary data  and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves  and the solution back to the original space

User · Answer

Since scikit-learn 0 20 you can use sklearn compose ColumnTransformer and sklearn preprocessing OneHotEncoder   If you only have categorical variables  OneHotEncoder directly   from sklearn preprocessing import OneHotEncoder  OneHotEncoder handle unknown  ignore   fit transform df    If you have heterogeneously typed features   from sklearn compose import make column transformer from sklearn preprocessing import RobustScaler from sklearn preprocessing import OneHotEncoder  categorical columns     pets    owner    location   numerical columns     age    weigth    height   column trans   make column transformer       categorical columns  OneHotEncoder handle unknown  ignore         numerical columns  RobustScaler    column trans fit transform df    More options in the documentation  http   scikit-learn org stable modules compose html columntransformer-for-heterogeneous-data

User · Answer

this does not directly answer your question  for which Naputipulu Jon and PriceHardman have fantastic replies   However  for the purpose of a few classification tasks etc  you could use  pandas get dummies input df     this can input dataframe with categorical data and return a dataframe with binary values  variable values are encoded into column names in the resulting dataframe  more

User · Answer

The problem is the shape of the data  pd dataframe  you are passing to the fit function  You ve got to pass 1d list

User · Answer

A short way to LabelEncoder   multiple columns with a dict      from sklearn preprocessing import LabelEncoder le dict    col  LabelEncoder   for col in columns   for col in columns      le dict col  fit transform df col     and you can use this le dict to labelEncode any other column    le dict col  transform df another col

User · Answer

We don t need a LabelEncoder   You can convert the columns to categoricals and then get their codes   I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names    gt  gt  gt  pd DataFrame  col  df col  astype  category   cat codes for col in df   index df index     location  owner  pets 0         1      1     0 1         0      2     1 2         0      0     0 3         1      1     2 4         1      3     1 5         0      2     1   To create a mapping dictionary  you can just enumerate the categories using a dictionary comprehension    gt  gt  gt   col   n  cat for n  cat in enumerate df col  astype  category   cat categories         for col in df     location    0   New York   1   San Diego      owner    0   Brick   1   Champ   2   Ron   3   Veronica      pets    0   cat   1   dog   2   monkey

User · Answer

As mentioned by larsmans  LabelEncoder   only takes a 1-d array as an argument  That said  it is quite easy to roll your own label encoder that operates on multiple columns of your choosing  and returns a transformed dataframe  My code here is based in part on Zac Stewart s excellent blog post found here   Creating a custom encoder involves simply creating a class that responds to the fit    transform    and fit transform   methods  In your case  a good start might be something like this    import pandas as pd from sklearn preprocessing import LabelEncoder from sklearn pipeline import Pipeline    Create some toy data in a Pandas dataframe fruit data   pd DataFrame        fruit      apple   orange   pear   orange         color      red   orange   green   green         weight    5 6 3 4      class MultiColumnLabelEncoder      def   init   self columns   None           self columns   columns   array of column names to encode      def fit self X y None           return self   not relevant here      def transform self X                       Transforms columns of X specified in self columns using         LabelEncoder    If no columns specified  transforms all         columns in X                      output   X copy           if self columns is not None              for col in self columns                  output col    LabelEncoder   fit transform output col           else              for colname col in output iteritems                    output colname    LabelEncoder   fit transform col          return output      def fit transform self X y None           return self fit X y  transform X    Suppose we want to encode our two categorical attributes  fruit and color   while leaving the numeric attribute weight alone  We could do this as follows    MultiColumnLabelEncoder columns     fruit   color    fit transform fruit data    Which transforms our fruit data dataset from   to     Passing it a dataframe consisting entirely of categorical variables and omitting the columns parameter will result in every column being encoded  which I believe is what you were originally looking for     MultiColumnLabelEncoder   fit transform fruit data drop  weight  axis 1     This transforms   to     Note that it ll probably choke when it tries to encode attributes that are already numeric  add some code to handle this if you like    Another nice feature about this is that we can use this custom transformer in a pipeline    encoding pipeline   Pipeline         encoding  MultiColumnLabelEncoder columns   fruit   color           add more pipeline steps as needed    encoding pipeline fit transform fruit data

User · Answer

If you have numerical and categorical both type of data in dataframe  You can use   here X is my dataframe having categorical and numerical both variables  from sklearn import preprocessing le   preprocessing LabelEncoder    for i in range 0 X shape 1        if X dtypes i    object           X X columns i     le fit transform X X columns i      Note  This technique is good if you are not interested in converting them back

[python] Label encoding across multiple columns in scikit-learn

Using Neuraxle

Same shared encoder for columns:

Different encoders per column:

Examples related to python

Examples related to pandas

Examples related to scikit-learn