Creating dummy variables in pandas for python

Question

I m trying to create a series of dummy variables from a categorical variable using pandas in python  I ve come across the get dummies function  but whenever I try to call it I receive an error that the name is not defined    Any thoughts or other ways to create the dummy variables would be appreciated   EDIT  Since others seem to be coming across this  the get dummies function in pandas now works perfectly fine  This means the following should work   import pandas as pd  dummies   pd get dummies df  Category      See http   blog yhathq com posts logistic-regression-and-python html for further information

User · Answer

Handling categorical features scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?

Ordered categories: transform them to sensible numeric values (example: small=1, medium=2, large=3) Unordered categories: use dummy encoding (0/1) What are the categorical features in our dataset?

Ordered categories: weather (already encoded with sensible numeric values) Unordered categories: season (needs dummy encoding), holiday (already dummy encoded), workingday (already dummy encoded) For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables:

# An utility function to create dummy variable
`def create_dummies( df, colname ):
col_dummies = pd.get_dummies(df[colname], prefix=colname)
col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
df = pd.concat([df, col_dummies], axis=1)
df.drop( colname, axis = 1, inplace = True )
return df`

User · Answer

So I was actually needing an answer to this question today  7 25 2013   so I wrote this earlier  I ve tested it with some toy examples  hopefully you get some mileage out of it  def categorize dict x  y 0         x Requires string or numerical input       y is a boolean that specifices whether to return category names along with the dict        default is no     cats   list set x       n   len cats      m   len x      outs          for i in cats          outs i     0  m     for i in range len x            outs x i   i    1     if y          return outs cats     return outs

User · Answer

You can create dummy variables to handle the categorical data

# Creating dummy variables for categorical datatypes
trainDfDummies = pd.get_dummies(trainDf, columns=['Col1', 'Col2', 'Col3', 'Col4'])

This will drop the original columns in trainDf and append the column with dummy variables at the end of the trainDfDummies dataframe.

It automatically creates the column names by appending the values at the end of the original column name.

User · Answer

For my case  dmatrices in patsy solved my problem  Actually  this function is designed for the generation of dependent and independent variables from a given DataFrame with an R-style formula string  But it can be used for the generation of dummy features from the categorical features  All you need to do would be drop the column  Intercept  that is generated by dmatrices automatically regardless of your original DataFrame   import pandas as pd from patsy import dmatrices  df original   pd DataFrame       A     red    green    red    green        B     car    car    truck    truck        C    10 11 12 13       D     alice    bob    charlie    alice        index  0  1  2  3       df dummyfied   dmatrices  A   A   B   C   D   data df original  return type  dataframe   df dummyfied   df dummyfied drop  Intercept   axis 1   df dummyfied columns     Index  u A T red    u B T truck    u D T bob    u D T charlie    u C    dtype  object    df dummyfied    A T red   B T truck   D T bob   D T charlie      C 0       1 0         0 0       0 0           0 0  10 0 1       0 0         0 0       1 0           0 0  11 0 2       1 0         1 0       0 0           1 0  12 0 3       0 0         1 0       0 0           0 0  13 0

User · Answer

The following code returns dataframe with the 'Category' column replaced by categorical columns:

df_with_dummies = pd.get_dummies(df, prefix='Category_', columns=['Category'])

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

User · Answer

I created a dummy variable for every state using this code.

def create_dummy_column(series, f):
    return series.apply(f)

for el in df.area_title.unique():
    col_name = el.split()[0] + "_dummy"
    f = lambda x: int(x==el)
    df[col_name] = create_dummy_column(df.area_title, f)
df.head()

More generally, I would just use .apply and pass it an anonymous function with the inequality that defines your category.

(Thank you to @prpl.mnky.dshwshr for the .unique() insight)

User · Answer

When I think of dummy variables I think of using them in the context of OLS regression  and I would do something like this   import numpy as np import pandas as pd import statsmodels api as sm  my data   np array   5   a   1                        3   b   3                        1   b   2                        3   a   1                        4   b   2                        7   c   1                        7   c   1                      df   pd DataFrame data my data  columns   y    dummy    x    just dummies   pd get dummies df  dummy     step 1   pd concat  df  just dummies   axis 1        step 1 drop   dummy    c    inplace True  axis 1    to run the regression we want to get rid of the strings  a    b    c   obviously    and we want to get rid of one dummy variable to avoid the dummy variable trap   arbitrarily chose  c   coefficients on  a  an  b  would show effect of  a  and  b    relative to  c  step 1   step 1 applymap np int    result   sm OLS step 1  y    sm add constant step 1   x    a    b      fit   print result summary

User · Answer

Based on the official documentation   dummies   pd get dummies df  Category    rename columns lambda x   Category     str x   df   pd concat  df  dummies   axis 1  df   df drop   Category    inplace True  axis 1    There is also a nice post in the FastML blog

User · Answer

It s hard to infer what you re looking for from the question  but my best guess is as follows   If we assume you have a DataFrame where some column is  Category  and contains integers  or otherwise unique identifiers  for categories  then we can do the following   Call the DataFrame dfrm  and assume that for each row  dfrm  Category   is some value in the set of integers from 1 to N  Then   for elem in dfrm  Category   unique        dfrm str elem     dfrm  Category      elem   Now there will be a new indicator column for each category that is True False depending on whether the data in that row are in that category   If you want to control the category names  you could make a dictionary  such as  cat names    1  Some Treatment   2  Full Treatment   3  Control   for elem in dfrm  Category   unique        dfrm cat names elem     dfrm  Category      elem   to result in having columns with specified names  rather than just string conversion of the category values  In fact  for some types  str   may not produce anything useful for you

[python] Creating dummy variables in pandas for python

The answer is

Examples related to python

Examples related to pandas

Tags