[python] pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.

I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic). Hopefully this makes sense. Any help will be greatly appreciated.

Its almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.

From the dataframe below I need to calculate a new column based on the following spec in SQL:

========================= CRITERIA ===============================

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”

Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”

====================== DATAFRAME ===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

This question is related to python pandas numpy apply

The answer is


The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


First, define conditions:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

Now, define the corresponding outputs:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.


try this,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

use .loc instead of apply.

it improves vectorization.

.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

for more details visit, .loc docs

Performance metrics:

Accepted Answer:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Proposed Answer:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


.apply() takes in a function as the first parameter; pass in the label_race function as so:

df['race_label'] = df.apply(label_race, axis=1)

You don't need to make a lambda function to pass in a function.


As @user3483203 pointed out, numpy.select is the best approach

Store your conditional statements and the corresponding actions in two lists

conds = [(df['eri_hispanic'] == 1),(df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1)),(df['eri_nat_amer'] == 1),(df['eri_asian'] == 1),(df['eri_afr_amer'] == 1),(df['eri_hawaiian'] == 1),(df['eri_white'] == 1,])

actions = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']

You can now use np.select using these lists as its arguments

df['label_race'] = np.select(conds,actions,default='Other')

Reference: https://numpy.org/doc/stable/reference/generated/numpy.select.html


Since this is the first Google result for 'pandas new column from others', here's a simple example:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here's the documentation for apply, and assign.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index

Examples related to apply

pandas create new column based on values from other columns / apply a function of multiple columns, row-wise R Apply() function on specific dataframe columns Why isn't my Pandas 'apply' function referencing multiple columns working? python pandas: apply a function with arguments to a series Apply function to each column in a data frame observing each columns existing data type Apply a function to every row of a matrix or a data frame Remove columns from dataframe where ALL values are NA