pandas How do I split text in a column into multiple rows

Question

I m working with a large csv file and the next to last column has a string of text that I want to split by a specific delimiter  I was wondering if there is a simple way to do this using pandas or python   CustNum  CustomerName     ItemQty  Item   Seatblocks                 ItemExt 32363    McCartney  Paul      3     F04    2 218 10 4 6                   60 31316    Lennon  John        25     F01    1 13 36 1 12 1 13 37 1 13     300   I want to split by the space      and then the colon      in the Seatblocks column  but each cell would result in a different number of columns  I have a function to rearrange the columns so the Seatblocks column is at the end of the sheet  but I m not sure what to do from there  I can do it in excel with the built in text-to-columns function and a quick macro  but my dataset has too many records for excel to handle   Ultimately  I want to take records such John Lennon s and create multiple lines  with the info from each set of seats on a separate line

User · Answer

Differently from Dan  I consider his answer quite elegant    but unfortunately it is also very very inefficient  So  since the question mentioned  a large csv file   let me suggest to try in a shell Dan s solution   time python -c  import pandas as pd  df   pd DataFrame   a b c   100000  columns   col     print df  col   apply lambda x   pd Series x split        head          compared to this alternative   time python -c  import pandas as pd  from scipy import array  concatenate  df   pd DataFrame   a b c   100000  columns   col     print pd DataFrame concatenate df  col   apply  lambda x    x split          head          and this   time python -c  import pandas as pd  df   pd DataFrame   a b c   100000  columns   col     print pd DataFrame dict zip range 3    df  col   apply lambda x   x split      i   for i in range 3      head      The second simply refrains from allocating 100 000 Series  and this is enough to make it around 10 times faster  But the third solution  which somewhat ironically wastes a lot of calls to str split    it is called once per column per row  so three times more than for the others two solutions   is around 40 times faster than the first  because it even avoids to instance the 100 000 lists  And yes  it is certainly a little ugly     EDIT  this answer suggests how to use  to list    and to avoid the need for a lambda  The result is something like  time python -c  import pandas as pd  df   pd DataFrame   a b c   100000  columns   col     print pd DataFrame df col str split   tolist    head      which is even more efficient than the third solution  and certainly much more elegant   EDIT  the even simpler  time python -c  import pandas as pd  df   pd DataFrame   a b c   100000  columns   col     print pd DataFrame list df col str split     head      works too  and is almost as efficient   EDIT  even simpler  And handles NaNs  but less efficient    time python -c  import pandas as pd  df   pd DataFrame   a b c   100000  columns   col     print df col str split expand True  head

User · Answer

Another approach would be like this    temp   df  Seatblocks   str split      data   data reindex data index repeat temp apply len    data  new Seatblocks     np hstack temp

User · Answer

This splits the Seatblocks by space and gives each its own row   In  43   df Out 43       CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt 0    32363  McCartney  Paul        3  F04               2 218 10 4 6       60 1    31316     Lennon  John       25  F01  1 13 36 1 12 1 13 37 1 13      300  In  44   s   df  Seatblocks   str split      apply Series  1  stack    In  45   s index   s index droplevel -1    to line up with df s index  In  46   s name    Seatblocks    needs a name to join  In  47   s Out 47    0    2 218 10 4 6 1    1 13 36 1 12 1    1 13 37 1 13 Name  Seatblocks  dtype  object  In  48   del df  Seatblocks    In  49   df join s  Out 49       CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks 0    32363  McCartney  Paul        3  F04       60  2 218 10 4 6 1    31316     Lennon  John       25  F01      300  1 13 36 1 12 1    31316     Lennon  John       25  F01      300  1 13 37 1 13   Or  to give each colon-separated string in its own column   In  50   df join s apply lambda x  Series x split         Out 50       CustNum     CustomerName  ItemQty Item  ItemExt  0    1   2     3 0    32363  McCartney  Paul        3  F04       60  2  218  10   4 6 1    31316     Lennon  John       25  F01      300  1   13  36  1 12 1    31316     Lennon  John       25  F01      300  1   13  37  1 13   This is a little ugly  but maybe someone will chime in with a prettier solution

User · Answer

import pandas as pd import numpy as np  df   pd DataFrame   ItemQty    0  3  1  25                        Seatblocks    0   2 218 10 4 6   1   1 13 36 1 12 1 13 37 1 13                         ItemExt    0  60  1  300                        CustomerName    0   McCartney  Paul   1   Lennon  John                         CustNum    0  32363  1  31316                        Item    0   F04   1   F01                          columns   CustNum   CustomerName   ItemQty   Item   Seatblocks   ItemExt     print  df     CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt 0    32363  McCartney  Paul        3  F04               2 218 10 4 6       60 1    31316     Lennon  John       25  F01  1 13 36 1 12 1 13 37 1 13      300   Another similar solution with chaining is use reset index and rename   print  df drop  Seatblocks   axis 1                join                             df Seatblocks               str               split expand True                stack                 reset index drop True  level 1                rename  Seatblocks                                  CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks 0    32363  McCartney  Paul        3  F04       60  2 218 10 4 6 1    31316     Lennon  John       25  F01      300  1 13 36 1 12 1    31316     Lennon  John       25  F01      300  1 13 37 1 13     If in column are NOT NaN values  the fastest solution is use list comprehension with DataFrame constructor   df   pd DataFrame   a b c   100000  columns   col     In  141    timeit  pd DataFrame dict zip range 3    df  col   apply lambda x   x split      i   for i in range 3       1 loop  best of 3  211 ms per loop  In  142    timeit  pd DataFrame df col str split   tolist     10 loops  best of 3  87 8 ms per loop  In  143    timeit  pd DataFrame list df col str split      10 loops  best of 3  86 1 ms per loop  In  144    timeit  df col str split expand True   10 loops  best of 3  156 ms per loop  In  145    timeit  pd DataFrame   x split   for x in df  col   tolist      10 loops  best of 3  54 1 ms per loop   But if column contains NaN only works str split with parameter expand True which return DataFrame  documentation   and it explain why it is slowier   df   pd DataFrame   a b c   10  columns   col    df loc 0    np nan print  df head         col 0    NaN 1  a b c 2  a b c 3  a b c 4  a b c  print  df col str split expand True        0     1     2 0  NaN  None  None 1    a     b     c 2    a     b     c 3    a     b     c 4    a     b     c 5    a     b     c 6    a     b     c 7    a     b     c 8    a     b     c 9    a     b     c

User · Answer

It may be late to answer this question but I hope to document 2 good features from Pandas  pandas Series str split   with regular expression and pandas Series explode    import pandas as pd import numpy as np  df   pd DataFrame        CustNum    32363  31316         CustomerName     McCartney  Paul    Lennon  John          ItemQty    3  25         Item     F04    F01          Seatblocks     2 218 10 4 6    1 13 36 1 12 1 13 37 1 13          ItemExt    60  360           print df  print  -  80   n    df  Seatblocks     df  Seatblocks   str split         df   df explode  Seatblocks   reset index drop True  cols   list df columns  cols append cols pop cols index  CustomerName     df   df cols    print df  print     80   n   print df df  CustomerName       Lennon  John     The output is     CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt 0    32363  McCartney  Paul        3  F04               2 218 10 4 6       60 1    31316     Lennon  John       25  F01  1 13 36 1 12 1 13 37 1 13      360 --------------------------------------------------------------------------------      CustNum  ItemQty Item Seatblocks  ItemExt     CustomerName 0     32363        3  F04          2       60  McCartney  Paul 1     32363        3  F04        218       60  McCartney  Paul 2     32363        3  F04         10       60  McCartney  Paul 3     32363        3  F04        4 6       60  McCartney  Paul 4     31316       25  F01          1      360     Lennon  John 5     31316       25  F01         13      360     Lennon  John 6     31316       25  F01         36      360     Lennon  John 7     31316       25  F01       1 12      360     Lennon  John 8     31316       25  F01          1      360     Lennon  John 9     31316       25  F01         13      360     Lennon  John 10    31316       25  F01         37      360     Lennon  John 11    31316       25  F01       1 13      360     Lennon  John                                                                                       CustNum  ItemQty Item Seatblocks  ItemExt  CustomerName 4     31316       25  F01          1      360  Lennon  John 5     31316       25  F01         13      360  Lennon  John 6     31316       25  F01         36      360  Lennon  John 7     31316       25  F01       1 12      360  Lennon  John 8     31316       25  F01          1      360  Lennon  John 9     31316       25  F01         13      360  Lennon  John 10    31316       25  F01         37      360  Lennon  John 11    31316       25  F01       1 13      360  Lennon  John

User · Answer

This seems a far easier method than those suggested elsewhere in this thread   split rows in pandas dataframe

User · Answer

Can also use groupby   with no need to join and stack      Use above example data    import pandas as pd import numpy as np   df   pd DataFrame   ItemQty    0  3  1  25                        Seatblocks    0   2 218 10 4 6   1   1 13 36 1 12 1 13 37 1 13                         ItemExt    0  60  1  300                        CustomerName    0   McCartney  Paul   1   Lennon  John                         CustNum    0  32363  1  31316                        Item    0   F04   1   F01                          columns   CustNum   CustomerName   ItemQty   Item   Seatblocks   ItemExt     print df      CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt 0  32363    McCartney  Paul  3        F04  2 218 10 4 6               60      1  31316    Lennon  John     25       F01  1 13 36 1 12 1 13 37 1 13  300      first define a function  given a Series of string  split each element into a new series def split series ser sep       return pd Series ser str cat sep sep  split sep sep     test the function   split series pd Series   a b   c    sep      0    a 1    b 2    c dtype  object  df2  df groupby df columns drop  Seatblocks   tolist     group by all but one column             Seatblocks    select the column to be split            apply split series sep        split  Seatblocks  in each group           reset index drop True level -1  reset index     remove extra index created  print df2     CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks 0    31316     Lennon  John       25  F01      300  1 13 36 1 12 1    31316     Lennon  John       25  F01      300  1 13 37 1 13 2    32363  McCartney  Paul        3  F04       60  2 218 10 4 6

[python] pandas: How do I split text in a column into multiple rows?

Examples related to python

Examples related to pandas

Examples related to dataframe