Combine two columns of text in pandas dataframe

Question

I have a 20 x 4000 dataframe in Python using pandas  Two of these columns are named Year and quarter  I d like to create a variable called period that makes Year   2000 and quarter  q2 into 2000q2   Can anyone help with that

User · Answer

One can use assign method of DataFrame   df   pd DataFrame   Year     2014    2015     quarter     q1    q2        assign period lambda x  x Year x quarter

User · Answer

Here is my summary of the above solutions to concatenate   combine two columns with int and str value into a new column  using a separator between the values of columns  Three solutions work for this purpose     be cautious about the separator  some symbols may cause  SyntaxError  EOL while scanning string literal     e g       as separator would raise the SyntaxError  separator     amp  amp       pd Series str cat   method does not work to concatenate   combine two columns with int value and str value  This would raise  AttributeError  Can only use  cat accessor with a  category  dtype   df  period     df  Year   map str    separator   df  quarter   df  period     df   Year   quarter    apply lambda x        amp  amp      format x 0  x 1    axis 1  df  period     df apply lambda x  f  x  Year     amp  amp   x  quarter      axis 1

User · Answer

This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values  This works not only for strings but for all kind of column-dtypes  import pandas as pd df   pd DataFrame   Year     2014    2015     quarter     q1    q2     df  list   df   Year   quarter    values tolist   df  period   df  list   apply    join  print df    Result      Year quarter        list  period 0  2014      q1   2014  q1   2014q1 1  2015      q2   2015  q2   2015q2

User · Answer

Although the  silvado answer is good if you change df map str  to df astype str  it will be faster   import pandas as pd df   pd DataFrame   Year     2014    2015     quarter     q1    q2      In  131    timeit df  Year   map str  10000 loops  best of 3  132 us per loop  In  132    timeit df  Year   astype str  10000 loops  best of 3  82 2 us per loop

User · Answer

As many have mentioned previously  you must convert each column to string and then use the plus operator to combine two string columns  You can get a large performance improvement by using NumPy    timeit df  Year   values astype str    df quarter 71 1 ms    3 76 ms per loop  mean    std  dev  of 7 runs  10 loops each    timeit df  Year   astype str    df  quarter   565 ms    22 3 ms per loop  mean    std  dev  of 7 runs  1 loop each

User · Answer

Here is an implementation that I find very versatile    In  1   import pandas as pd   In  2   df   pd DataFrame   0   the    quick    brown                                1   fox    jumps    over                                 2   the    lazy    dog                               columns   c0    c1    c2    c3     In  3   def str join df  sep   cols               from functools import reduce             return reduce lambda x  y  x astype str  str cat y astype str   sep sep                               df col  for col in cols             In  4   df  cat     str join df   -    c0    c1    c2    c3    In  5   df Out 5       c0   c1     c2     c3                cat 0   0  the  quick  brown  0-the-quick-brown 1   1  fox  jumps   over   1-fox-jumps-over 2   2  the   lazy    dog     2-the-lazy-dog

User · Answer

more efficient is  def concat df str1 df           run time  1 3416s         return pd Series     join row astype str   for row in df values   index df index    and here is a time test   import numpy as np import pandas as pd  from time import time   def concat df str1 df           run time  1 3416s         return pd Series     join row astype str   for row in df values   index df index    def concat df str2 df           run time  5 2758s         return df astype str  sum axis 1    def concat df str3 df           run time  5 0076s         df   df astype str      return df 0    df 1    df 2    df 3    df 4                 df 5    df 6    df 7    df 8    df 9    def concat df str4 df           run time  7 8624s         return df astype str  apply lambda x     join x   axis 1    def main        df   pd DataFrame np zeros 1000000  reshape 100000  10       df   df astype int       time1   time       df en   concat df str4 df      print  run time    4fs     time   - time1       print df en head 10     if   name         main         main     final  when sum concat df str2  is used  the result is not simply concat  it will trans to integer

User · Answer

Using zip could be even quicker   df  period         join i  for i in zip df  Year   map str  df  quarter       Graph     import pandas as pd import numpy as np import timeit import matplotlib pyplot as plt from collections import defaultdict  df   pd DataFrame   Year     2014    2015     quarter     q1    q2      myfuncs      df  Year   astype str    df  quarter         lambda  df  Year   astype str    df  quarter     df  Year   map str    df  quarter         lambda  df  Year   map str    df  quarter     df Year str cat df quarter        lambda  df Year str cat df quarter    df loc      Year   quarter    astype str  sum axis 1        lambda  df loc      Year   quarter    astype str  sum axis 1    df   Year   quarter    astype str  sum axis 1        lambda  df   Year   quarter    astype str  sum axis 1        df   Year   quarter    apply lambda x          format x 0  x 1    axis 1        lambda  df   Year   quarter    apply lambda x          format x 0  x 1    axis 1            join i  for i in zip dataframe  Year   map str  dataframe  quarter           lambda      join i  for i in zip df  Year   map str  df  quarter        d   defaultdict dict  step   10 cont   True while cont      lendf   len df   print lendf      for k v in myfuncs items            iters   1         t   0         while t  lt  0 2              ts   timeit repeat v  number iters  repeat 3              t   min ts              iters    10         d k  lendf    t iters         if t  gt  2  cont   False     df   pd concat  df  step   pd DataFrame d  plot   legend loc  upper center   bbox to anchor  0 5  -0 15   plt yscale  log    plt xscale  log    plt ylabel  seconds    plt xlabel  df rows   plt show

User · Answer

if both columns are strings  you can concatenate them directly   df  period     df  Year     df  quarter     If one  or both  of the columns are not string typed  you should convert it  them  first   df  period     df  Year   astype str    df  quarter     Beware of NaNs when doing this     If you need to join multiple string columns  you can use agg   df  period     df   Year    quarter         agg  -  join  axis 1    Where  -  is the separator

User · Answer

my take     listofcols     col1   col2   col3   df  combined cols         for column in listofcols      df  combined cols     df  combined cols           df column

User · Answer

def madd x          Performs element-wise string concatenation with multiple input arrays       Args          x  iterable of np array       Returns  np array              for i  arr in enumerate x           if type arr item 0   is not str              x i    x i  astype str      return reduce np core defchararray add  x    For example   data   list zip  2000  4    q1    q2    q3    q4     df   pd DataFrame data data  columns   Year    quarter    df  period     madd  df col  values for col in   Year    quarter      df      Year    quarter period 0   2000    q1  2000q1 1   2000    q2  2000q2 2   2000    q3  2000q3 3   2000    q4  2000q4

User · Answer

Use  combine first   df  Period     df  Year   combine first df  Quarter

User · Answer

The method cat   of the  str accessor works really well for this    gt  gt  gt  import pandas as pd  gt  gt  gt  df   pd DataFrame    2014    q1                              2015    q3                           columns   Year    Quarter     gt  gt  gt  print df     Year Quarter 0  2014      q1 1  2015      q3  gt  gt  gt  df  Period     df Year str cat df Quarter   gt  gt  gt  print df     Year Quarter  Period 0  2014      q1  2014q1 1  2015      q3  2015q3   cat   even allows you to add a separator so  for example  suppose you only have integers for year and period  you can do this    gt  gt  gt  import pandas as pd  gt  gt  gt  df   pd DataFrame   2014  1                           2015  3                          columns   Year    Quarter     gt  gt  gt  print df     Year Quarter 0  2014       1 1  2015       3  gt  gt  gt  df  Period     df Year astype str  str cat df Quarter astype str   sep  q    gt  gt  gt  print df     Year Quarter  Period 0  2014       1  2014q1 1  2015       3  2015q3   Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str cat   invoked on the first column  Series     gt  gt  gt  df   pd DataFrame             USA    Nevada    Las Vegas               Brazil    Pernambuco    Recife             columns   Country    State    City           gt  gt  gt  df  AllTogether     df  Country   str cat df   State    City     sep   -     gt  gt  gt  print df    Country       State       City                   AllTogether 0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas 1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife   Do note that if your pandas dataframe series has null values  you need to include the parameter na rep to replace the NaN values with a string  otherwise the combined column will default to NaN

User · Answer

Small data-sets   lt  150rows      join i  for i in zip df  quot Year quot   map str  df  quot quarter quot      or slightly slower but more compact  df Year str cat df quarter   Larger data sets   gt  150rows  df  Year   astype str    df  quarter     UPDATE  Timing graph Pandas 0 23 4  Let s test it on 200K rows DF  In  250   df Out 250      Year quarter 0  2014      q1 1  2015      q2  In  251   df   pd concat  df    10  5   In  252   df shape Out 252    200000  2   UPDATE  new timings using Pandas 0 19 0 Timing without CPU GPU optimization  sorted from fastest to slowest   In  107    timeit df  Year   astype str    df  quarter   10 loops  best of 3  131 ms per loop  In  106    timeit df  Year   map str    df  quarter   10 loops  best of 3  161 ms per loop  In  108    timeit df Year str cat df quarter  10 loops  best of 3  189 ms per loop  In  109    timeit df loc      Year   quarter    astype str  sum axis 1  1 loop  best of 3  567 ms per loop  In  110    timeit df   Year   quarter    astype str  sum axis 1  1 loop  best of 3  584 ms per loop  In  111    timeit df   Year   quarter    apply lambda x          format x 0  x 1    axis 1  1 loop  best of 3  24 7 s per loop  Timing using CPU GPU optimization  In  113    timeit df  Year   astype str    df  quarter   10 loops  best of 3  53 3 ms per loop  In  114    timeit df  Year   map str    df  quarter   10 loops  best of 3  65 5 ms per loop  In  115    timeit df Year str cat df quarter  10 loops  best of 3  79 9 ms per loop  In  116    timeit df loc      Year   quarter    astype str  sum axis 1  1 loop  best of 3  230 ms per loop  In  117    timeit df   Year   quarter    astype str  sum axis 1  1 loop  best of 3  230 ms per loop  In  118    timeit df   Year   quarter    apply lambda x          format x 0  x 1    axis 1  1 loop  best of 3  9 38 s per loop  Answer contribution by  anton-vbr

User · Answer

df   pd DataFrame   Year     2014    2015     quarter     q1    q2     df  period     df   Year    quarter    apply lambda x     join x   axis 1    Yields this dataframe     Year quarter  period 0  2014      q1  2014q1 1  2015      q2  2015q2   This method generalizes to an arbitrary number of string columns by replacing df   Year    quarter    with any column slice of your dataframe  e g  df iloc   0 2  apply lambda x     join x   axis 1    You can check more information about apply   method here

User · Answer

generalising to multiple columns  why not   columns     whatever    columns    you    choose   df  period     df columns  astype str  sum axis 1

User · Answer

Use of a lamba function this time with string format       import pandas as pd df   pd DataFrame   Year     2014    2015     Quarter     q1    q2     print df df  YearQuarter     df   Year   Quarter    apply lambda x          format x 0  x 1    axis 1  print df    Quarter  Year 0      q1  2014 1      q2  2015   Quarter  Year YearQuarter 0      q1  2014      2014q1 1      q2  2015      2015q2   This allows you to work with non-strings and reformat values as needed   import pandas as pd df   pd DataFrame   Year     2014    2015     Quarter    1  2    print df dtypes print df  df  YearQuarter     df   Year   Quarter    apply lambda x      q    format x 0  x 1    axis 1  print df  Quarter     int64 Year       object dtype  object    Quarter  Year 0        1  2014 1        2  2015    Quarter  Year YearQuarter 0        1  2014      2014q1 1        2  2015      2015q2

User · Answer

Let us suppose your  dataframe is df with columns Year and Quarter   import pandas as pd df   pd DataFrame   Quarter   q1 q2 q3 q4  split     Year   2000      Suppose we want to see the dataframe   df  gt  gt  gt   Quarter    Year    0    q1      2000    1    q2      2000    2    q3      2000    3    q4      2000   Finally  concatenate the Year and the Quarter as follows   df  Period     df  Year           df  Quarter     You can now print df  to see the resulting dataframe   df  gt  gt  gt   Quarter    Year    Period     0   q1      2000    2000 q1     1   q2      2000    2000 q2     2   q3      2000    2000 q3     3   q4      2000    2000 q4   If you do not want the space between the year and quarter  simply remove it by doing   df  Period     df  Year     df  Quarter

[python] Combine two columns of text in pandas dataframe

Examples related to python

Examples related to pandas

Examples related to dataframe