Pandas create two new columns in a dataframe with values calculated from a pre-existing column

Question

I am working with the pandas library and I want to add two new columns to a dataframe df with n columns  n   0   These new columns result from the application of a function to one of the columns in the dataframe   The function to apply is like   def calculate x          operate        return z  y   One method for creating a new column for a function returning only a value is   df  new col      df  column A   map a function    So  what I want  and tried unsuccesfully      is something like    df  new col zetas    df  new col ys      df  column A   map calculate    What the best way to accomplish this could be   I scanned the documentation with no clue      df  column A   map calculate  returns a pandas Series each item consisting of a tuple z  y  And trying to assign this to two dataframe columns produces a ValueError

User · Answer

The top answer is flawed in my opinion  Hopefully  no one is mass importing all of pandas into their namespace with from pandas import    Also  the map method should be reserved for those times when passing it a dictionary or Series  It can take a function but this is what apply is used for   So  if you must use the above approach  I would write it like this  df  A1    df  A2     zip  df  a   apply calculate     There s actually no reason to use zip here  You can simply do this   df  A1    df  A2     calculate df  a      This second method is also much faster on larger DataFrames  df   pd DataFrame   a    1 2 3    100000   b    2 3 4    100000     DataFrame created with 300 000 rows   timeit df  A1    df  A2     calculate df  a    2 65 ms    92 4   s per loop  mean    std  dev  of 7 runs  100 loops each    timeit df  A1    df  A2     zip  df  a   apply calculate   159 ms    5 24 ms per loop  mean    std  dev  of 7 runs  10 loops each    60x faster than zip    In general  avoid using apply  Apply is generally not much faster than iterating over a Python list  Let s test the performance of a for-loop to do the same thing as above    timeit A1  A2          for val in df  a        A1 append val  2      A2 append val  3   df  A1     A1 df  A2     A2  298 ms    7 14 ms per loop  mean    std  dev  of 7 runs  1 loop each    So this is twice as slow which isn t a terrible performance regression  but if we cythonize the above  we get much better performance  Assuming  you are using ipython    load ext cython    cython cpdef power vals       A1  A2              cdef double val     for val in vals          A1 append val  2          A2 append val  3       return A1  A2   timeit df  A1    df  A2     power df  a    72 7 ms    2 16 ms per loop  mean    std  dev  of 7 runs  10 loops each      Directly assigning without apply  You can get even greater speed improvements if you use the direct vectorized operations    timeit df  A1    df  A2     df  a      2  df  a      3 5 13 ms    320   s per loop  mean    std  dev  of 7 runs  100 loops each    This takes advantage of NumPy s extremely fast vectorized operations instead of our loops  We now have a 30x speedup over the original     The simplest speed test with apply  The above example should clearly show how slow apply can be  but just so its extra clear let s look at the most basic example  Let s square a Series of 10 million numbers with and without apply  s   pd Series np random rand 10000000     timeit s apply calc  3 3 s    57 4 ms per loop  mean    std  dev  of 7 runs  1 loop each    Without apply is 50x faster   timeit s    2 66 ms    2 ms per loop  mean    std  dev  of 7 runs  10 loops each

User · Answer

I d just use zip   In  1   from pandas import    In  2   def calculate x               return x 2  x 3           In  3   df   DataFrame   a    1 2 3    b    2 3 4     In  4   df Out 4       a  b 0  1  2 1  2  3 2  3  4  In  5   df  A1    df  A2     zip  df  a   map calculate    In  6   df Out 6       a  b  A1  A2 0  1  2   2   3 1  2  3   4   6 2  3  4   6   9

[python] Pandas: create two new columns in a dataframe with values calculated from a pre-existing column

In general, avoid using apply

Directly assigning without apply

The simplest speed test with `apply`

Examples related to python

Examples related to pandas