How can I one hot encode in Python

Question

I have a machine learning classification problem with 80  categorical variables  Must I use one hot encoding if I want to use some classifier for the classification  Can i pass the data to a classifier without the encoding    I am trying to do the following for feature selection    I read the train file   num rows to read   10000 train small   pd read csv        dataset train csv     nrows num rows to read   I change the type of the categorical features to  category    non categorial features     orig destination distance                              srch adults cnt                              srch children cnt                              srch rm cnt                              cnt    for categorical feature in list train small columns       if categorical feature not in non categorial features          train small categorical feature    train small categorical feature  astype  category    I use one hot encoding    train small with dummies   pd get dummies train small  sparse True     The problem is that the 3 rd part often get stuck  although I am using a strong machine   Thus  without the one hot encoding I can t do any feature selection  for determining the importance of the features   What do you recommend

User · Answer

Try this    pip install category encoders import category encoders as ce  categorical columns       the list of names of the columns you want to one-hot-encode      encoder   ce OneHotEncoder cols categorical columns  use cat names True  df train encoded   encoder fit transform df train small    df encoded head    The resulting dataframe df train encoded is the same as the original  but the categorical features are now replaced with their one-hot-encoded versions   More information on category encoders here

User · Answer

To add to other questions  let me provide how I did it with a Python 2 0 function using Numpy    def one hot y          Function to encode output labels from number indexes        e g     5    0    3   -- gt    0  0  0  0  0  1    1  0  0  0  0  0    0  0  0  1  0  0        y    y  reshape len y        n values   np max y     1     return np eye n values  np array y   dtype np int32      Returns FLOATS   The line n values   np max y     1 could be hard-coded for you to use the good number of neurons in case you use mini-batches for example    Demo project tutorial where this function has been used   https   github com guillaume-chevalier LSTM-Human-Activity-Recognition

User · Answer

Much easier to use Pandas for basic one-hot encoding  If you re looking for more options you can use scikit-learn  For basic one-hot encoding with Pandas you pass your data frame into the get dummies function  For example  if I have a dataframe called imdb movies      and I want to one-hot encode the Rated column  I do this  pd get dummies imdb movies Rated    This returns a new dataframe with a column for every  quot level quot  of rating that exists  along with either a 1 or 0 specifying the presence of that rating for a given observation  Usually  we want this to be part of the original dataframe  In this case  we attach our new dummy coded frame onto the original frame using  quot column-binding  We can column-bind by using Pandas concat function  rated dummies   pd get dummies imdb movies Rated  pd concat  imdb movies  rated dummies   axis 1    We can now run an analysis on our full dataframe  SIMPLE UTILITY FUNCTION I would recommend making yourself a utility function to do this quickly  def encode and bind original dataframe  feature to encode       dummies   pd get dummies original dataframe  feature to encode        res   pd concat  original dataframe  dummies   axis 1      return res   Usage  encode and bind imdb movies   Rated    Result   Also  as per  pmalbu comment  if you would like the function to remove the original feature to encode then use this version  def encode and bind original dataframe  feature to encode       dummies   pd get dummies original dataframe  feature to encode        res   pd concat  original dataframe  dummies   axis 1      res   res drop  feature to encode   axis 1      return res    You can encode multiple features at the same time as follows  features to encode     feature 1    feature 2    feature 3                          feature 4   for feature in features to encode      res   encode and bind train set  feature

User · Answer

You can use numpy eye function   import numpy as np  def one hot encode x  n classes               One hot encode a list of sample labels  Return a one-hot encoded vector for each label        x  List of sample Labels       return  Numpy array of one-hot encoded labels              return np eye n classes  x   def main        list    0 1 2 3 4 3 2 1 0      n classes   5     one hot list   one hot encode list  n classes      print one hot list   if   name         main         main     Result  D  Desktop gt python test py    1   0   0   0   0      0   1   0   0   0      0   0   1   0   0      0   0   0   1   0      0   0   0   0   1      0   0   0   1   0      0   0   1   0   0      0   1   0   0   0      1   0   0   0   0

User · Answer

Approach 1  You can use pandas  pd get dummies  Example 1  import pandas as pd s   pd Series list  abca    pd get dummies s  Out          a    b    c 0  1 0  0 0  0 0 1  0 0  1 0  0 0 2  0 0  0 0  1 0 3  1 0  0 0  0 0  Example 2  The following will transform a given column into one hot  Use prefix to have multiple dummies  import pandas as pd          df   pd DataFrame              A    a   b   a               B    b   a   c              df Out        A  B 0  a  b 1  b  a 2  a  c    Get one hot encoding of columns B one hot   pd get dummies df  B      Drop column B as it is now encoded df   df drop  B  axis   1    Join the encoded df df   df join one hot  df   Out            A  a  b  c     0  a  0  1  0     1  b  1  0  0     2  a  0  0  1  Approach 2  Use Scikit-learn Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance  We also have handle unknown to further control what the encoder does with unseen data  Given a dataset with three features and four samples  we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding   gt  gt  gt  from sklearn preprocessing import OneHotEncoder  gt  gt  gt  enc   OneHotEncoder    gt  gt  gt  enc fit   0  0  3    1  1  0    0  2  1    1  0  2       OneHotEncoder categorical features  all   dtype  lt class  numpy float64  gt      handle unknown  error   n values  auto   sparse True   gt  gt  gt  enc n values  array  2  3  4    gt  gt  gt  enc feature indices  array  0  2  5  9   dtype int32   gt  gt  gt  enc transform   0  1  1    toarray   array    1    0    0    1    0    0    1    0    0      Here is the link for this example  http   scikit-learn org stable modules generated sklearn preprocessing OneHotEncoder html

User · Answer

It can and it should be easy as    class OneHotEncoder      def   init   self optionKeys           length len optionKeys          self   dict    optionKeys j   0 if i  j else 1 for i in range length   for j in range length     Usage    ohe OneHotEncoder   A   B   C   D    print ohe A  print ohe D

User · Answer

I know I m late to this party  but the simplest way to hot encode a dataframe in an automated way is to use this function   def hot encode df       obj df   df select dtypes include   object        return pd get dummies df  columns obj df columns  values

User · Answer

One hot encoding with pandas is very easy   def one hot df  cols                param df pandas DataFrame      param cols a list of columns to encode       return a DataFrame with one-hot encoding             for each in cols          dummies   pd get dummies df each   prefix each  drop first False          df   pd concat  df  dummies   axis 1      return df   EDIT   Another way to one hot using sklearn s LabelBinarizer    from sklearn preprocessing import LabelBinarizer  label binarizer   LabelBinarizer   label binarizer fit all your labels list    need to be global or remembered to use it later  def one hot encode x               One hot encode a list of sample labels  Return a one-hot encoded vector for each label        x  List of sample Labels       return  Numpy array of one-hot encoded labels             return label binarizer transform x

User · Answer

Expanding  Martin Thoma s answer  def one hot encode y          Convert an iterable of indices to one-hot encoded labels         y   y flatten     Sometimes not flattened vector is passed e g  118 1  in these cases       the function ends up creating a tensor e g   118  2  1   flatten removes this issue     nb classes   len np unique y     get the number of unique classes     standardised labels   dict zip np unique y   np arange nb classes      get the class labels as a dictionary       which then is standardised  E g imagine class labels are  4 7 9  if a vector of y containing 4 7 and 9 is       directly passed then np eye nb classes  4  or 7 9 throws an out of index error        standardised labels fixes this issue by returning a dictionary        standardised labels    4 0  7 1  9 2   The values of the dictionary are mapped to keys in y array        standardised labels also removes the error that is raised if the labels are floats  E g  1 0  element       cannot be called by an integer index e g y 1 0  - throws an index error      targets   np vectorize standardised labels get  y    map the dictionary values to array      return np eye nb classes  targets

User · Answer

You can pass the data to catboost classifier without encoding  Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding

User · Answer

pandas as has inbuilt function  get dummies  to get one hot encoding of that particular column s   one line code for one-hot-encoding   df pd concat  df pd get dummies df  column name   prefix  column name    axis 1  drop   column name   axis 1

User · Answer

Here i tried with this approach    import numpy as np  converting to one hot      def one hot encoder value  datal        datal value    1      return datal   def  one hot values labels data       encoded    0    len labels data       for j  i in enumerate labels data           max value    0     np max labels data    1           encoded j    one hot encoder i  max value       return np array encoded

User · Answer

You can do it with numpy eye and a using the array element selection mechanism   import numpy as np nb classes   6 data     2  3  4  0    def indices to one hot data  nb classes          Convert an iterable of indices to one-hot encoded labels         targets   np array data  reshape -1      return np eye nb classes  targets    The the return value of indices to one hot nb classes  data  is now  array     0    0    1    0    0    0              0    0    0    1    0    0              0    0    0    0    1    0              1    0    0    0    0    0        The  reshape -1  is there to make sure you have the right labels format  you might also have   2    3    4    0

User · Answer

Firstly  easiest way to one hot encode  use Sklearn   http   scikit-learn org stable modules generated sklearn preprocessing OneHotEncoder html  Secondly  I don t think using pandas to one hot encode is that simple  unconfirmed though   Creating dummy variables in pandas for python  Lastly  is it necessary for you to one hot encode  One hot encoding exponentially increases the number of features  drastically increasing the run time of any classifier or anything else you are going to run  Especially when each categorical feature has many levels  Instead you can do dummy coding   Using dummy encoding usually works well  for much less run time and complexity  A wise prof once told me   Less is More     Here s the code for my custom encoding function if you want   from sklearn preprocessing import LabelEncoder   Auto encodes any dataframe column of type category or object  def dummyEncode df           columnsToEncode   list df select dtypes include   category   object             le   LabelEncoder           for feature in columnsToEncode              try                  df feature    le fit transform df feature               except                  print  Error encoding   feature          return df   EDIT  Comparison to be clearer   One-hot encoding  convert n levels to n-1 columns   Index  Animal         Index  cat  mouse   1     dog             1     0     0   2     cat       -- gt    2     1     0   3    mouse            3     0     1   You can see how this will explode your memory if you have many different types  or levels  in your categorical feature  Keep in mind  this is just ONE column   Dummy Coding   Index  Animal         Index  Animal   1     dog             1      0      2     cat       -- gt    2      1    3    mouse            3      2   Convert to numerical representations instead  Greatly saves feature space  at the cost of a bit of accuracy

User · Answer

Short Answer  Here is a function to do one-hot-encoding without using numpy  pandas  or other packages  It takes a list of integers  booleans  or strings  and perhaps other types too    import typing   def one hot encode items  list  - gt  typing List list       results            find the unique items  we want to unique items b c duplicate items will have the same encoding      unique items   list set items         sort the unique items     sorted items   sorted unique items        find how long the list of each item should be     max index   len unique items       for item in items            create a list of zeros the appropriate length         one hot encoded result    0 for i in range 0  max index             find the index of the item         one hot index   sorted items index item            change the zero at the index from the previous line to a one         one hot encoded result one hot index    1           add the result         results append one hot encoded result       return results   Example   one hot encode  2  1  1  2  5  3        0  1  0  0       1  0  0  0       1  0  0  0       0  1  0  0       0  0  0  1       0  0  1  0     one hot encode  True  False  True        0  1    1  0    0  1     one hot encode   a    b    c    a    e         1  0  0  0    0  1  0  0    0  0  1  0    1  0  0  0    0  0  0  1     Long er  Answer  I know there are already a lot of answers to this question  but I noticed two things  First  most of the answers use packages like numpy and or pandas  And this is a good thing  If you are writing production code  you should probably be using robust  fast algorithms like those provided in the numpy pandas packages  But  for the sake of education  I think someone should provide an answer which has a transparent algorithm and not just an implementation of someone else s algorithm  Second  I noticed that many of the answers do not provide a robust implementation of one-hot encoding because they do not meet one of the requirements below  Below are some of the requirements  as I see them  for a useful  accurate  and robust one-hot encoding function   A one-hot encoding function must    handle list of various types  e g  integers  strings  floats  etc   as input handle an input list with duplicates return a list of lists corresponding  in the same order as  to the inputs return a list of lists where each list is as short as possible   I tested many of the answers to this question and most of them fail on one of the requirements above

User · Answer

One-hot encoding requires bit more than converting the values to indicator variables  Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data  You should store the mapping  transform  that was used to construct the model  A good solution would use the DictVectorizer or LabelEncoder  followed by get dummies  Here is a function that you can use   def oneHotEncode2 df  le dict            if not le dict          columnsToEncode   list df select dtypes include   category   object             train   True      else          columnsToEncode   le dict keys              train   False       for feature in columnsToEncode          if train              le dict feature    LabelEncoder           try              if train                  df feature    le dict feature  fit transform df feature               else                  df feature    le dict feature  transform df feature                df   pd concat  df                                 pd get dummies df feature   rename columns lambda x  feature         str x     axis 1              df   df drop feature  axis 1          except              print  Error encoding   feature               df feature     df feature  convert objects convert numeric  force               df feature     df feature  apply pd to numeric  errors  coerce       return  df  le dict    This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back  So you would call it like this   train data  le dict   oneHotEncode2 train data    Then on the test data  the call is made by passing the dictionary returned back from training   test data      oneHotEncode2 test data  le dict    An equivalent method is to use DictVectorizer  A related post on the same is on my blog  I mention it here since it provides some reasoning behind this approach over simply using get dummies post   disclosure  this is my own blog

User · Answer

I used this in my acoustic model  probably this helps in ur model   def one hot encoding x  n out       x   x astype int        shape   x shape     x   x flatten       N   len x      x categ   np zeros  N n out       x categ np arange N   x    1     return x categ reshape  shape   n out

User · Answer

Here is a solution using DictVectorizer and the Pandas DataFrame to dict  records   method    gt  gt  gt  import pandas as pd  gt  gt  gt  X   pd DataFrame   income    100000 110000 90000 30000 14000 50000                          country    US    CAN    US    CAN    MEX    US                           race    White    Black    Latino    White    White    Black                             gt  gt  gt  from sklearn feature extraction import DictVectorizer  gt  gt  gt  v   DictVectorizer    gt  gt  gt  qualitative features     country   race    gt  gt  gt  X qual   v fit transform X qualitative features  to dict  records     gt  gt  gt  v vocabulary    country CAN   0    country MEX   1    country US   2    race Black   3    race Latino   4    race White   5    gt  gt  gt  X qual toarray   array    0    0    1    0    0    1             1    0    0    1    0    0             0    0    1    0    1    0             1    0    0    0    0    1             0    1    0    0    0    1             0    0    1    1    0    0

User · Answer

You can do the following as well  Note for the below you don t have to use pd concat    import pandas as pd    intialise data of lists   data     Color    Red    Yellow    Red    Yellow     Length   20 1  21 1  19 1  18 1           Group   1 2 1 2       Create DataFrame  df   pd DataFrame data    for  c in df select dtypes include   object    columns      print  c      df  c     pd Categorical df  c   df transformed   pd get dummies df  df transformed   You can also change explicit columns to categorical  For example  here I am changing the Color and Group  import pandas as pd    intialise data of lists   data     Color    Red    Yellow    Red    Yellow     Length   20 1  21 1  19 1  18 1           Group   1 2 1 2       Create DataFrame  df   pd DataFrame data   columns to change   list df select dtypes include   object    columns  columns to change append  Group   for  c in columns to change      print  c      df  c     pd Categorical df  c   df transformed   pd get dummies df  df transformed

User · Answer

This works for me   pandas factorize    B    C    D    B     0    Output    0  1  2  0

[python] How can I one hot encode in Python?

Examples related to python

Examples related to pandas

Examples related to machine-learning

Examples related to one-hot-encoding