Apply pandas function to column to create multiple new columns

Question

How to do this in pandas   I have a function extract text features on a single text column  returning multiple output columns  Specifically  the function returns 6 values   The function works  however there doesn t seem to be any proper return type  pandas DataFrame  numpy array  Python list  such that the output can get correctly assigned df ix    10 16    df textcol map extract text features   So I think I need to drop back to iterating with df iterrows    as per this   UPDATE   Iterating with df iterrows   is at least 20x slower  so I surrendered and split out the function into six distinct  map lambda      calls   UPDATE 2  this question was asked back around v0 11 0  Hence much of the question and answers are not too relevant

User · Answer

This is what I ve done in the past  df   pd DataFrame   textcol    np random rand 5     df     textcol 0  0 626524 1  0 119967 2  0 803650 3  0 100880 4  0 017859  df textcol apply lambda s  pd Series   feature1  s 1   feature2  s-1       feature1  feature2 0  1 626524 -0 373476 1  1 119967 -0 880033 2  1 803650 -0 196350 3  1 100880 -0 899120 4  1 017859 -0 982141   Editing for completeness  pd concat  df  df textcol apply lambda s  pd Series   feature1  s 1   feature2  s-1      axis 1      textcol feature1  feature2 0  0 626524 1 626524 -0 373476 1  0 119967 1 119967 -0 880033 2  0 803650 1 803650 -0 196350 3  0 100880 1 100880 -0 899120 4  0 017859 1 017859 -0 982141

User · Answer

In 2020  I use apply   with argument result type  expand   gt  gt  gt  appiled df   df apply lambda row  fn row text   axis  columns   result type  expand    gt  gt  gt  df   pd concat  df  appiled df   axis  columns

User · Answer

The accepted solution is going to be extremely slow for lots of data  The solution with the greatest number of upvotes is a little difficult to read and also slow with numeric data  If each new column can be calculated independently of the others  I would just assign each of them directly without using apply   Example with fake character data  Create 100 000 strings in a DataFrame  df   pd DataFrame np random choice   he jumped    she ran    they hiked                                       size 100000  replace True                     columns   words    df head           words 0     she ran 1     she ran 2  they hiked 3  they hiked 4  they hiked   Let s say we wanted to extract some text features as done in the original question  For instance  let s extract the first character  count the occurrence of the letter  e  and capitalize the phrase   df  first     df  words   str 0  df  count e     df  words   str count  e   df  cap     df  words   str capitalize   df head           words first  count e         cap 0     she ran     s        1     She ran 1     she ran     s        1     She ran 2  they hiked     t        2  They hiked 3  they hiked     t        2  They hiked 4  they hiked     t        2  They hiked   Timings    timeit df  first     df  words   str 0  df  count e     df  words   str count  e   df  cap     df  words   str capitalize   127 ms    585   s per loop  mean    std  dev  of 7 runs  10 loops each   def extract text features x       return x 0   x count  e    x capitalize     timeit df  first    df  count e    df  cap     zip  df  words   apply extract text features   101 ms    2 96 ms per loop  mean    std  dev  of 7 runs  10 loops each    Surprisingly  you can get better performance by looping through each value    timeit a b c              for s in df  words        a append s 0    b append s count  e     c append s capitalize     df  first     a df  count e     b df  cap     c 79 1 ms    294   s per loop  mean    std  dev  of 7 runs  10 loops each    Another example with fake numeric data  Create 1 million random numbers and test the powers function from above   df   pd DataFrame np random rand 1000000   columns   num      def powers x       return x  x  2  x  3  x  4  x  5  x  6    timeit df  p1    df  p2    df  p3    df  p4    df  p5    df  p6              zip  df  num   map powers   1 35 s    83 6 ms per loop  mean    std  dev  of 7 runs  1 loop each    Assigning each column is 25x faster and very readable     timeit  df  p1     df  num      1 df  p2     df  num      2 df  p3     df  num      3 df  p4     df  num      4 df  p5     df  num      5 df  p6     df  num      6 51 6 ms    1 9 ms per loop  mean    std  dev  of 7 runs  10 loops each    I made a similar response with more details here on why apply is typically not the way to go

User · Answer

Building off of user1827356  s answer  you can do the assignment in one pass using df merge   df merge df textcol apply lambda s  pd Series   feature1  s 1   feature2  s-1          left index True  right index True       textcol  feature1  feature2 0  0 772692  1 772692 -0 227308 1  0 857210  1 857210 -0 142790 2  0 065639  1 065639 -0 934361 3  0 819160  1 819160 -0 180840 4  0 088212  1 088212 -0 911788   EDIT  Please be aware of the huge memory consumption and low speed  https   ys-l github io posts 2015 08 28 how-not-to-use-pandas-apply

User · Answer

def extract text features feature                       return pd Series  feature1  feature2     df   NewFeature1    NewFeature1      df   feature    apply extract text features  axis 1    Here the a dataframe with a single feature is being converted to two new features  Give this a try too

User · Answer

I ve looked several ways of doing this and the method shown here  returning a pandas series  doesn t seem to be most efficient   If we start with a largeish dataframe of random data     Setup a dataframe of random numbers and create a  df   pd DataFrame np random randn 10000 3  columns list  ABC    df  D     df apply lambda r      join map str   r A  r B  r C     axis 1  columns    new a    new b    new c    The example shown here     Create the dataframe by returning a series def method b v       return pd Series  k  v for k  v in zip columns  v split          timeit -n10 -r3 df D apply method b       10 loops  best of 3  2 77 s per loop   An alternative method     Create a dataframe from a series of tuples def method a v       return v split       timeit -n10 -r3 pd DataFrame df D apply method a  tolist    columns columns       10 loops  best of 3  8 85 ms per loop   By my reckoning it s far more efficient to take a series of tuples and then convert that to a DataFrame  I d be interested to hear people s thinking though if there s an error in my working

User · Answer

I usually do this using zip    gt  gt  gt  df   pd DataFrame   i  for i in range 10    columns   num     gt  gt  gt  df     num 0    0 1    1 2    2 3    3 4    4 5    5 6    6 7    7 8    8 9    9   gt  gt  gt  def powers x    gt  gt  gt      return x  x  2  x  3  x  4  x  5  x  6   gt  gt  gt  df  p1    df  p2    df  p3    df  p4    df  p5    df  p6        gt  gt  gt      zip  df  num   map powers     gt  gt  gt  df         num     p1      p2      p3      p4      p5      p6 0       0       0       0       0       0       0       0 1       1       1       1       1       1       1       1 2       2       2       4       8       16      32      64 3       3       3       9       27      81      243     729 4       4       4       16      64      256     1024    4096 5       5       5       25      125     625     3125    15625 6       6       6       36      216     1296    7776    46656 7       7       7       49      343     2401    16807   117649 8       8       8       64      512     4096    32768   262144 9       9       9       81      729     6561    59049   531441

User · Answer

Summary  If you only want to create a few columns  use df   new col1   new col2      df   data1   data2    apply  function of your choosing x   axis 1   For this solution  the number of new columns you are creating must be equal to the number columns you use as input to the  apply   function  If you want to do something else  have a look at the other answers   Details Let s say you have two-column dataframe  The first column is a person s height when they are 10  the second is said person s height when they are 20    Suppose you need to calculate both the mean of each person s heights and sum of each person s heights  That s two values per each row   You could do this via the following  soon-to-be-applied function   def mean and sum x               Calculates the mean and sum of two heights      Parameters       x -- the values in the row this function is applied to  Could also work on a list or a tuple               sum x 0  x 1      mean sum 2     return  mean sum    You might use this function like so    df   height at age 10   height at age 20    apply mean and sum x  axis 1     To be clear  this apply function takes in the values from each row in the subsetted dataframe and returns a list    However  if you do this   df  Mean  amp  Sum     df   height at age 10   height at age 20    apply mean and sum x  axis 1    you ll create 1 new column that contains the  mean sum  lists  which you d presumably want to avoid  because that would require another Lambda Apply   Instead  you want to break out each value into its own column  To do this  you can create two columns at once   df   Mean   Sum      df   height at age 10   height at age 20     apply mean and sum x  axis 1

User · Answer

Just use result type  expand   df   pd DataFrame np random randint 0 10  10 2    columns   random    a    df   sq a   cube a      df apply lambda x   x a  2  x a  3   axis 1  result type  expand

User · Answer

you can return the entire row instead of values   df   df apply extract text features axis   1    where the function returns the row  def extract text features row         row  new col1     value1       row  new col2     value2       return row

User · Answer

I have a more complicated situation  the dataset has a nested structure  import json data      quot TextID quot    quot 0 quot   quot 0038f0569e quot   quot 1 quot   quot 003eb6998d quot   quot 2 quot   quot 006da49ea0 quot    quot Summary quot    quot 0 quot    quot Crisis Level quot    quot c quot    quot Type quot    quot d quot    quot Special Date quot    quot a quot     quot 1 quot    quot Crisis Level quot    quot d quot    quot Type quot    quot a quot   quot d quot    quot Special Date quot    quot a quot     quot 2 quot    quot Crisis Level quot    quot d quot    quot Type quot    quot a quot    quot Special Date quot    quot a quot       df   pd DataFrame from dict json loads data   print df   output          TextID                                            Summary 0  0038f0569e    Crisis Level     c     Type     d     Specia    1  003eb6998d    Crisis Level     d     Type     a    d     S    2  006da49ea0    Crisis Level     d     Type     a     Specia     The Summary column contains dict objects  so I use apply with from dict and stack to extract each row of dict  df2   df apply      lambda x  pd DataFrame from dict x 1   orient  index   stack    axis 1  print df2   output      Crisis Level Special Date Type                      0            0    0    1 0            c            a    d  NaN 1            d            a    a    d 2            d            a    a  NaN  Looks good  but missing the TextID column  To get TextID column back  I ve tried three approach   Modify apply to return multiple columns  df tmp   df copy    df tmp   TextID    Summary      df apply      lambda x  pd Series  x 0   pd DataFrame from dict x 1   orient  index   stack      axis 1  print df tmp   output      TextID                                            Summary 0  0038f0569e  Crisis Level  0    c Type          0    d Spec    1  003eb6998d  Crisis Level  0    d Type          0    a         2  006da49ea0  Crisis Level  0    d Type          0    a Spec     But this is not what I want  the Summary structure are flatten   Use pd concat  df tmp2   pd concat  df  TextID    df2   axis 1  print df tmp2   output      TextID  Crisis Level  0   Special Date  0   Type  0   Type  1  0  0038f0569e                 c                 a         d       NaN 1  003eb6998d                 d                 a         a         d 2  006da49ea0                 d                 a         a       NaN  Looks fine  the MultiIndex column structure are preserved as tuple  But check columns type  df tmp2 columns  output  Index   TextID     Crisis Level   0     Special Date   0     Type   0         Type   1        dtype  object    Just as a regular Index class  not MultiIndex class   use set index  Turn all columns you want to preserve into row index  after some complicated apply function and then reset index to get columns back  df tmp3   df set index  TextID    df tmp3   df tmp3 apply      lambda x  pd DataFrame from dict x 0   orient  index   stack    axis 1   df tmp3   df tmp3 reset index level 0  print df tmp3   output      TextID Crisis Level Special Date Type                              0            0    0    1 0  0038f0569e            c            a    d  NaN 1  003eb6998d            d            a    a    d 2  006da49ea0            d            a    a  NaN  Check the type of columns df tmp3 columns  output  MultiIndex levels    Crisis Level    Special Date    Type    TextID     0  1                codes   3  0  1  2  2    2  0  0  0  1       So  If your apply function will return MultiIndex columns  and you want to preserve it  you may want to try the third method

User · Answer

For me this worked  Input df df   pd DataFrame   col x    1 2 3       col x 0      1 1      2 2      3  Function def f x       return pd Series  x x  x x x    Create 2 new columns  df   square x    cube x      df  col x   apply f   Output     col x  square x  cube x 0      1         1       1 1      2         4       8 2      3         9      27

User · Answer

This is the correct and easiest way to accomplish this for 95  of use cases    gt  gt  gt  df   pd DataFrame zip   range 10     columns   num     gt  gt  gt  df     num 0    0 1    1 2    2 3    3 4    4 5    5   gt  gt  gt  def example x           x  p1     x  num    2         x  p2     x  num    3         x  p3     x  num    4         return x   gt  gt  gt  df   df apply example  axis 1   gt  gt  gt  df     num  p1  p2  p3 0    0   0   0    0 1    1   1   1    1 2    2   4   8   16 3    3   9  27   81 4    4  16  64  256

User · Answer

Have posted the same answer in two other similar questions  The way I prefer to do this is to wrap up the return values of the function in a series   def f x       return pd Series  x  2  x  3     And then use apply as follows to create separate columns   df   x  2   x  3      df apply lambda row  f row  x     axis 1

[python] Apply pandas function to column to create multiple new columns?

Examples related to python

Examples related to pandas

Examples related to merge

Examples related to multiple-columns

Examples related to return-type