[python] String concatenation of two pandas columns

I have a following DataFrame:

from pandas import *
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})

It looks like this:

    bar foo
0    1   a
1    2   b
2    3   c

Now I want to have something like:

     bar
0    1 is a
1    2 is b
2    3 is c

How can I achieve this? I tried the following:

df['foo'] = '%s is %s' % (df['bar'], df['foo'])

but it gives me a wrong result:

>>>print df.ix[0]

bar                                                    a
foo    0    a
1    b
2    c
Name: bar is 0    1
1    2
2
Name: 0

Sorry for a dumb question, but this one pandas: combine two columns in a DataFrame wasn't helpful for me.

This question is related to python string pandas numpy dataframe

The answer is


You could also use

df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')

series.str.cat is the most flexible way to approach this problem:

For df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})

df.foo.str.cat(df.bar.astype(str), sep=' is ')

>>>  0    a is 1
     1    b is 2
     2    c is 3
     Name: foo, dtype: object

OR

df.bar.astype(str).str.cat(df.foo, sep=' is ')

>>>  0    1 is a
     1    2 is b
     2    3 is c
     Name: bar, dtype: object

Unlike .join() (which is for joining list contained in a single Series), this method is for joining 2 Series together. It also allows you to ignore or replace NaN values as desired.


@DanielVelkov answer is the proper one BUT using string literals is faster:

# Daniel's
%timeit df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
## 963 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# String literals - python 3
%timeit df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
## 849 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I have encountered a specific case from my side with 10^11 rows in my dataframe, and in this case none of the proposed solution is appropriate. I have used categories, and this should work fine in all cases when the number of unique string is not too large. This is easily done in the R software with XxY with factors but I could not find any other way to do it in python (I'm new to python). If anyone knows a place where this is implemented I'd be glad to know.

def Create_Interaction_var(df,Varnames):
    '''
    :df data frame
    :list of 2 column names, say "X" and "Y". 
    The two columns should be strings or categories
    convert strings columns to categories
    Add a column with the "interaction of X and Y" : X x Y, with name 
    "Interaction-X_Y"
    '''
    df.loc[:, Varnames[0]] = df.loc[:, Varnames[0]].astype("category")
    df.loc[:, Varnames[1]] = df.loc[:, Varnames[1]].astype("category")
    CatVar = "Interaction-" + "-".join(Varnames)
    Var0Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[0]].cat.categories)).rename(columns={0 : "code0",1 : "name0"})
    Var1Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[1]].cat.categories)).rename(columns={0 : "code1",1 : "name1"})
    NbLevels=len(Var0Levels)

    names = pd.DataFrame(list(itertools.product(dict(enumerate(df.loc[:,Varnames[0]].cat.categories)),
                                                dict(enumerate(df.loc[:,Varnames[1]].cat.categories)))),
                         columns=['code0', 'code1']).merge(Var0Levels,on="code0").merge(Var1Levels,on="code1")
    names=names.assign(Interaction=[str(x) + '_' + y for x, y in zip(names["name0"], names["name1"])])
    names["code01"]=names["code0"] + NbLevels*names["code1"]
    df.loc[:,CatVar]=df.loc[:,Varnames[0]].cat.codes+NbLevels*df.loc[:,Varnames[1]].cat.codes
    df.loc[:, CatVar]=  df[[CatVar]].replace(names.set_index("code01")[["Interaction"]].to_dict()['Interaction'])[CatVar]
    df.loc[:, CatVar] = df.loc[:, CatVar].astype("category")
    return df

The problem in your code is that you want to apply the operation on every row. The way you've written it though takes the whole 'bar' and 'foo' columns, converts them to strings and gives you back one big string. You can write it like:

df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)

It's longer than the other answer but is more generic (can be used with values that are not strings).


This question has already been answered, but I believe it would be good to throw some useful methods not previously discussed into the mix, and compare all methods proposed thus far in terms of performance.

Here are some useful solutions to this problem, in increasing order of performance.


DataFrame.agg

This is a simple str.format-based approach.

df['baz'] = df.agg('{0[bar]} is {0[foo]}'.format, axis=1)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

You can also use f-string formatting here:

df['baz'] = df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

char.array-based Concatenation

Convert the columns to concatenate as chararrays, then add them together.

a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)

df['baz'] = (a + b' is ' + b).astype(str)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

List Comprehension with zip

I cannot overstate how underrated list comprehensions are in pandas.

df['baz'] = [str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])]

Alternatively, using str.join to concat (will also scale better):

df['baz'] = [
    ' '.join([str(x), 'is', y]) for x, y in zip(df['bar'], df['foo'])]

df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

List comprehensions excel in string manipulation, because string operations are inherently hard to vectorize, and most pandas "vectorised" functions are basically wrappers around loops. I have written extensively about this topic in For loops with pandas - When should I care?. In general, if you don't have to worry about index alignment, use a list comprehension when dealing with string and regex operations.

The list comp above by default does not handle NaNs. However, you could always write a function wrapping a try-except if you needed to handle it.

def try_concat(x, y):
    try:
        return str(x) + ' is ' + y
    except (ValueError, TypeError):
        return np.nan


df['baz'] = [try_concat(x, y) for x, y in zip(df['bar'], df['foo'])]

perfplot Performance Measurements

enter image description here

Graph generated using perfplot. Here's the complete code listing.

Functions

def brenbarn(df):
    return df.assign(baz=df.bar.map(str) + " is " + df.foo)

def danielvelkov(df):
    return df.assign(baz=df.apply(
        lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1))

def chrimuelle(df):
    return df.assign(
        baz=df['bar'].astype(str).str.cat(df['foo'].values, sep=' is '))

def vladimiryashin(df):
    return df.assign(baz=df.astype(str).apply(lambda x: ' is '.join(x), axis=1))

def erickfis(df):
    return df.assign(
        baz=df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1))

def cs1_format(df):
    return df.assign(baz=df.agg('{0[bar]} is {0[foo]}'.format, axis=1))

def cs1_fstrings(df):
    return df.assign(baz=df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1))

def cs2(df):
    a = np.char.array(df['bar'].values)
    b = np.char.array(df['foo'].values)

    return df.assign(baz=(a + b' is ' + b).astype(str))

def cs3(df):
    return df.assign(
        baz=[str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])])

df['bar'] = df.bar.map(str) + " is " + df.foo.


df.astype(str).apply(lambda x: ' is '.join(x), axis=1)

0    1 is a
1    2 is b
2    3 is c
dtype: object

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe