Split explode pandas dataframe string entry to separate rows

Question

I have a pandas dataframe in which one column of text strings contains comma-separated values  I want to split each CSV field and create a new row per entry  assume that CSV are clean and need only be split on       For example  a should become b   In  7   a Out 7        var1  var2 0  a b c     1 1  d e f     2  In  8   b Out 8      var1  var2 0    a     1 1    b     1 2    c     1 3    d     2 4    e     2 5    f     2   So far  I have tried various simple functions  but the  apply method seems to only accept one row as return value when it is used on an axis  and I can t get  transform to work  Any suggestions would be much appreciated   Example data    from pandas import DataFrame import numpy as np a   DataFrame    var1    a b c    var2   1                    var1    d e f    var2   2    b   DataFrame    var1    a    var2   1                    var1    b    var2   1                    var1    c    var2   1                    var1    d    var2   2                    var1    e    var2   2                    var1    f    var2   2      I know this won t work because we lose DataFrame meta-data by going through numpy  but it should give you a sense of what I tried to do    def fun row       letters   row  var1       letters   letters split          out   np array  row    len letters       out  var1     letters a  idx     range a shape 0   z   a groupby  idx   z transform fun

User · Answer

I had a similar problem  my solution was converting the dataframe to a list of dictionaries first  then do the transition  Here is the function  import re import pandas as pd  def separate row df  column name       ls          for row dict in df to dict  records            for word in re split      row dict column name                row   row dict copy               row column name  word             ls append row      return pd DataFrame ls   Example   gt  gt  gt  from pandas import DataFrame  gt  gt  gt  import numpy as np  gt  gt  gt  a   DataFrame    var1    a b c    var2   1                    var1    d e f    var2   2     gt  gt  gt  a     var1  var2 0  a b c     1 1  d e f     2  gt  gt  gt  separate row a   quot var1 quot     var1  var2 0    a     1 1    b     1 2    c     1 3    d     2 4    e     2 5    f     2  You can also change the function a bit to support separating list type rows

User · Answer

There is a possibility to split and explode the dataframe without changing the structure of dataframe  Split and expand data of specific columns  Input      var1    var2 0   a b c   1 1   d e f   2     Get the indexes which are repetative with the split  df  var1     df  var1   str split      df   df explode  var1    Out      var1    var2 0   a   1 0   b   1 0   c   1 1   d   2 1   e   2 1   f   2  Edit-1  Split and Expand of rows for Multiple columns  Filename    RGB                                             RGB type 0   A     0  1650  6  39    0  1691  1  59    50  1402       r  g  b  1   B     0  1423  16  38    0  1445  16  46    0  141       r  g  b   Re indexing based on the reference column and aligning the column value information with stack df   df reindex df index repeat df  RGB type   apply len    df   df groupby  Filename   apply lambda x x apply lambda y  pd Series y iloc 0     df reset index drop True  ffill    Out                  Filename    RGB type    Top 1 colour    Top 1 frequency Top 2 colour    Top 2 frequency     Filename                              A  0       A   r   0   1650    6   39     1       A   g   0   1691    1   59     2       A   b   50  1402    49  187  B  0       B   r   0   1423    16  38     1       B   g   0   1445    16  46     2       B   b   0   1419    16  39

User · Answer

I came up with a solution for dataframes with arbitrary numbers of columns  while still only separating one column s entries at a time    def splitDataFrameList df target column separator           df   dataframe to split      target column   the column containing the values to split     separator   the symbol used to perform the split      returns  a dataframe with each entry for the target column separated  with each element moved into a new row       The values in the other columns are duplicated across the newly divided rows              def splitListToRows row row accumulator target column separator           split row   row target column  split separator          for s in split row              new row   row to dict               new row target column    s             row accumulator append new row      new rows          df apply splitListToRows axis 1 args    new rows target column separator       new df   pandas DataFrame new rows      return new df

User · Answer

The string function split can take an option boolean argument  expand    Here is a solution using this argument    a var1    str split     expand True     set index a var2     stack      reset index level 1  drop True     reset index      rename columns  0  var1

User · Answer

Based on the excellent  DMulligan s solution  here is a generic vectorized  no loops  function which splits a column of a dataframe into multiple rows  and merges it back to the original dataframe  It also uses a great generic change column order function from this answer   def change column order df  col name  index       cols   df columns tolist       cols remove col name      cols insert index  col name      return df cols   def split df dataframe  col name  sep       orig col index   dataframe columns tolist   index col name      orig index name   dataframe index name     orig columns   dataframe columns     dataframe   dataframe reset index      we need a natural 0-based index for proper merge     index col name    set dataframe columns  - set orig columns   pop       df split   pd DataFrame          pd DataFrame dataframe col name  str split sep  tolist             stack   reset index level 1  drop 1   columns  col name       df   dataframe drop col name  axis 1      df   pd merge df  df split  left index True  right index True  how  inner       df   df set index index col name      df index name   orig index name       merge adds the column to the last place  so we need to move it back     return change column order df  col name  orig col index    Example   df   pd DataFrame    a b   1  4     c d   2  5     e f g h   3  6                       columns   Name    A    B    index  10  12  13   df         Name    A   B     10   a b     1   4     12   c d     2   5     13   e f g h 3   6  split df df   Name            Name    A   B 10   a       1   4 10   b       1   4 12   c       2   5 12   d       2   5 13   e       3   6 13   f       3   6     13   g       3   6     13   h       3   6       Note that it preserves the original index and order of the columns  It also works with dataframes which have non-sequential index

User · Answer

How about something like this   In  55   pd concat  Series row  var2    row  var1   split                                         for    row in a iterrows     reset index   Out 55      index  0 0     a  1 1     b  1 2     c  1 3     d  2 4     e  2 5     f  2   Then you just have to rename the columns

User · Answer

upgraded MaxU s answer with MultiIndex support  def explode df  lst cols  fill value     preserve index False               usage          In  134   df         Out 134              aaa  myid        num          text         0   10     1   1  2  3    aa  bb  cc          1   11     2                                  2   12     3      1  2        cc  dd          3   13     4                                   In  135   explode df    num   text    fill value             Out 135              aaa  myid num text         0   10     1   1   aa         1   10     1   2   bb         2   10     1   3   cc         3   11     2         4   12     3   1   cc         5   12     3   2   dd         6   13     4               make sure  lst cols  is list-alike     if  lst cols is not None         and len lst cols   gt  0         and not isinstance lst cols   list  tuple  np ndarray  pd Series             lst cols    lst cols        all columns except  lst cols      idx cols   df columns difference lst cols        calculate lengths of lists     lens   df lst cols 0   str len         preserve original index values         idx   np repeat df index values  lens      res    pd DataFrame                   col np repeat df col  values  lens                  for col in idx cols                   index idx                assign    col np concatenate df loc lens gt 0  col  values                              for col in lst cols          append those rows that have empty lists     if  lens    0  any              at least one list in cells is empty         res    res append df loc lens  0  idx cols   sort False                     fillna fill value         revert the original index order     res   res sort index         reset index if requested     if not preserve index                  res   res reset index drop True         if original index is MultiIndex build the dataframe from the multiindex       create  exploded  DF     if isinstance df index  pd MultiIndex           res   res reindex              index pd MultiIndex from tuples                  res index                  names   number    color                           return res

User · Answer

Similar question as  pandas  How do I split text in a column into multiple rows   You could do    gt  gt  a pd DataFrame   var1   a b c d e f  split    var2   1 2     gt  gt  s   a var1 str split      apply pd Series  1  stack    gt  gt  s index   s index droplevel -1   gt  gt  del a  var1    gt  gt  a join s     var2 var1 0     1    a 0     1    b 0     1    c 1     2    d 1     2    e 1     2    f

User · Answer

TL DR  import pandas as pd import numpy as np  def explode str df  col  sep       s   df col      i   np arange len s   repeat s str count sep    1      return df iloc i  assign    col  sep join s  split sep     def explode list df  col       s   df col      i   np arange len s   repeat s str len        return df iloc i  assign    col  np concatenate s        Demonstration  explode str a   var1           var1  var2 0    a     1 0    b     1 0    c     1 1    d     2 1    e     2 1    f     2   Let s create a new dataframe d that has lists  d   a assign var1 lambda d  d var1 str split        explode list d   var1      var1  var2 0    a     1 0    b     1 0    c     1 1    d     2 1    e     2 1    f     2     General Comments  I ll use np arange with repeat to produce dataframe index positions that I can use with iloc   FAQ  Why don t I use loc   Because the index may not be unique and using loc will return every row that matches a queried index   Why don t you use the values attribute and slice that   When calling values  if the entirety of the the dataframe is in one cohesive  block   Pandas will return a view of the array that is the  block    Otherwise Pandas will have to cobble together a new array   When cobbling  that array must be of a uniform dtype   Often that means returning an array with dtype that is object   By using iloc instead of slicing the values attribute  I alleviate myself from having to deal with that   Why do you use assign   When I use assign using the same column name that I m exploding  I overwrite the existing column and maintain its position in the dataframe   Why are the index values repeat   By virtue of using iloc on repeated positions  the resulting index shows the same repeated pattern   One repeat for each element the list or string  This can be reset with reset index drop True     For Strings  I don t want to have to split the strings prematurely   So instead I count the occurrences of the sep argument assuming that if I were to split  the length of the resulting list would be one more than the number of separators   I then use that sep to join the strings then split   def explode str df  col  sep       s   df col      i   np arange len s   repeat s str count sep    1      return df iloc i  assign    col  sep join s  split sep      For Lists  Similar as for strings except I don t need to count occurrences of sep because its already split   I use Numpy s concatenate to jam the lists together   import pandas as pd import numpy as np  def explode list df  col       s   df col      i   np arange len s   repeat s str len        return df iloc i  assign    col  np concatenate s

User · Answer

Another solution that uses python copy package  import copy new observations   list   def pandas explode df  column to explode       new observations   list       for row in df to dict orient  records            explode values   row column to explode          del row column to explode          if type explode values  is list or type explode values  is tuple              for explode value in explode values                  new observation   copy deepcopy row                  new observation column to explode    explode value                 new observations append new observation           else              new observation   copy deepcopy row              new observation column to explode    explode values             new observations append new observation       return df   pd DataFrame new observations      return return df  df   pandas explode df  column name

User · Answer

Here s a function I wrote for this common task  It s more efficient than the Series stack methods  Column order and names are retained   def tidy split df  column  sep      keep False               Split the values of a column and expand so the new DataFrame has one split     value per row  Filters rows where the column is missing       Params     ------     df   pandas DataFrame         dataframe with the column to split and expand     column   str         the column to split and expand     sep   str         the string used to split the column s values     keep   bool         whether to retain the presplit value as it s own row      Returns     -------     pandas DataFrame         Returns a dataframe with the same columns as  df               indexes   list       new values   list       df   df dropna subset  column       for i  presplit in enumerate df column  astype str            values   presplit split sep          if keep and len values   gt  1              indexes append i              new values append presplit          for value in values              indexes append i              new values append value      new df   df iloc indexes     copy       new df column    new values     return new df   With this function  the original question is as simple as   tidy split a   var1   sep

User · Answer

There are a lot of answers here but I m surprised no one has mentioned the built in pandas explode function  Check out the link below  https   pandas pydata org pandas-docs stable reference api pandas DataFrame explode html pandas DataFrame explode  For some reason I was unable to access that function  so I used the below code   import pandas explode pandas explode patch   df zlp people cnt3   df zlp people cnt2 explode  people       Above is a sample of my data  As you can see the people column had series of people  and I was trying to explode it  The code I have given works for list type data  So try to get your comma separated text data into list format  Also since my code uses built in functions  it is much faster than custom apply functions   Note  You may need to install pandas explode with pip

User · Answer

I have been struggling with out-of-memory experience using various way to explode my lists so I prepared some benchmarks to help me decide which answers to upvote  I tested five scenarios with varying proportions of the list length to the number of lists  Sharing the results below   Time   less is better  click to view large version     Peak memory usage   less is better     Conclusions     MaxU s answer  update 2   codename concatenate offers the best speed in almost every case  while keeping the peek memory usage low  see  DMulligan s answer  codename stack  if you need to process lots of rows with relatively small lists and can afford increased peak memory  the accepted  Chang s answer works well for data frames that have a few rows but very large lists    Full details  functions and benchmarking code  are in this GitHub gist  Please note that the benchmark problem was simplified and did not include splitting of strings into the list - which most solutions performed in a similar fashion

User · Answer

Just used jiln s excellent answer from above  but needed to expand to split multiple columns  Thought I would share   def splitDataFrameList df target column separator       df   dataframe to split  target column   the column containing the values to split separator   the symbol used to perform the split  returns  a dataframe with each entry for the target column separated  with each element moved into a new row   The values in the other columns are duplicated across the newly divided rows      def splitListToRows row  row accumulator  target columns  separator       split rows          for target column in target columns          split rows append row target column  split separator         Seperate for multiple columns     for i in range len split rows 0             new row   row to dict           for j in range len split rows                new row target columns j     split rows j  i          row accumulator append new row  new rows      df apply splitListToRows axis 1 args    new rows target column separator   new df   pd DataFrame new rows  return new df

User · Answer

Pandas  gt   0 25 Series and DataFrame methods define a  explode   method that explodes lists into separate rows  See the docs section on Exploding a list-like column  Since you have a list of comma separated strings  split the string on comma to get a list of elements  then call explode on that column  df   pd DataFrame   var1     a b c    d e f     var2    1  2    df     var1  var2 0  a b c     1 1  d e f     2  df assign var1 df  var1   str split       explode  var1      var1  var2 0    a     1 0    b     1 0    c     1 1    d     2 1    e     2 1    f     2  Note that explode only works on a single column  for now   To explode multiple columns at once  see below  NaNs and empty lists get the treatment they deserve without you having to jump through hoops to get it right  df   pd DataFrame   var1     d e f       np nan    var2    1  2  3    df     var1  var2 0  d e f     1 1            2 2    NaN     3  df  var1   str split       0     d  e  f  1              2          NaN  df assign var1 df  var1   str split       explode  var1      var1  var2 0    d     1 0    e     1 0    f     1 1          2    empty list entry becomes empty string after exploding  2  NaN     3    NaN left un-touched  This is a serious advantage over ravel repeat -based solutions  which ignore empty lists completely  and choke on NaNs    Exploding Multiple Columns Note that explode only works on a single column at a time  but you can use apply to explode multiple column at once  df   pd DataFrame   var1     a b c    d e f                         var2     i j k    l m n                         var3    1  2    df     var1   var2  var3 0  a b c  i j k     1 1  d e f  l m n     2   df set index   var3         apply lambda col  col str split      explode        reset index       reindex df columns  axis 1    df   var1 var2  var3 0    a    i     1 1    b    j     1 2    c    k     1 3    d    l     2 4    e    m     2 5    f    n     2  The idea is to set as the index  all the columns that should NOT be exploded  then explode the remaining columns via apply  This works well when the lists are equally sized

User · Answer

I do appreciate the answer of  quot Chang She quot   really  but the iterrows   function takes long time on large dataset  I faced that issue and I came to this    First  reset index to make the index a column a   a reset index   rename columns   index   duplicated idx       Get a longer series with exploded cells to rows series   pd DataFrame a  var1   str split                             tolist    index a duplicated idx  stack      New df from series and merge with the old one b   series reset index  0   duplicated idx    b   b rename columns  0  var1       Optional  amp  Advanced  In case  there are other columns apart from var1  amp  var2 b merge      a a columns difference   var1          on  duplicated idx      Optional  Delete the  quot duplicated index quot  s column  and reorder columns b   b a columns difference   duplicated idx

User · Answer

One-liner using split      expand True  and the level and name arguments to reset index      gt  gt  gt  b   a var1 str split      expand True  set index a var2  stack   reset index level 0  name  var1    gt  gt  gt  b    var2 var1 0     1    a 1     1    b 2     1    c 0     2    d 1     2    e 2     2    f   If you need b to look exactly like in the question  you can additionally do    gt  gt  gt  b   b reset index drop True    var1    var2     gt  gt  gt  b   var1  var2 0    a     1 1    b     1 2    c     1 3    d     2 4    e     2 5    f     2

User · Answer

My version of the solution to add to this collection   -    Original problem from pandas import DataFrame import numpy as np a   DataFrame    var1    a b c    var2   1                    var1    d e f    var2   2    b   DataFrame    var1    a    var2   1                    var1    b    var2   1                    var1    c    var2   1                    var1    d    var2   2                    var1    e    var2   2                    var1    f    var2   2        My solution import pandas as pd import functools def expand on cols df  fuse cols  delim  quot   quot        def expand on col df  fuse col           col order   df columns         df expanded   pd DataFrame              df set index  x for x in df columns if x    fuse col   fuse col               apply lambda x  x split delim                explode             reset index           return df expanded col order      all expanded   functools reduce expand on col  fuse cols  df      return all expanded  assert b equals expand on cols a    quot var1 quot    delim  quot   quot

User · Answer

Here is a fairly straightforward message that uses the split method from pandas str accessor and then uses NumPy to flatten each row into a single array   The corresponding values are retrieved by repeating the non-split column the correct number of times with np repeat   var1   df var1 str split      expand True  values ravel   var2   np repeat df var2 values  len var1    len df    pd DataFrame   var1   var1                 var2   var2      var1  var2 0    a     1 1    b     1 2    c     1 3    d     2 4    e     2 5    f     2

User · Answer

Upon adding few bits and pieces from all the solutions on this page  I was able to get something like this for someone who need to use it right away   parameters to the function are df input dataframe  and key column that has delimiter separated string   Just replace with your delimiter if that is different to semicolon  quot   quot   def split df rows for semicolon separated key key  df       df df set index df columns drop key 1  tolist    key  str split      expand True  stack   reset index   rename columns  0 key   loc    df columns      df df df key             return df

User · Answer

UPDATE2  more generic vectorized function  which will work for multiple normal and multiple list columns  def explode df  lst cols  fill value     preserve index False         make sure  lst cols  is list-alike     if  lst cols is not None         and len lst cols   gt  0         and not isinstance lst cols   list  tuple  np ndarray  pd Series             lst cols    lst cols        all columns except  lst cols      idx cols   df columns difference lst cols        calculate lengths of lists     lens   df lst cols 0   str len         preserve original index values         idx   np repeat df index values  lens        create  exploded  DF     res    pd DataFrame                   col np repeat df col  values  lens                  for col in idx cols                   index idx                assign    col np concatenate df loc lens gt 0  col  values                              for col in lst cols          append those rows that have empty lists     if  lens    0  any              at least one list in cells is empty         res    res append df loc lens  0  idx cols   sort False                     fillna fill value         revert the original index order     res   res sort index         reset index if requested     if not preserve index                  res   res reset index drop True      return res   Demo   Multiple list columns - all list columns must have the same   of elements in each row   In  134   df Out 134      aaa  myid        num          text 0   10     1   1  2  3    aa  bb  cc  1   11     2                          2   12     3      1  2        cc  dd  3   13     4                           In  135   explode df    num   text    fill value     Out 135      aaa  myid num text 0   10     1   1   aa 1   10     1   2   bb 2   10     1   3   cc 3   11     2 4   12     3   1   cc 5   12     3   2   dd 6   13     4   preserving original index values   In  136   explode df    num   text    fill value     preserve index True  Out 136      aaa  myid num text 0   10     1   1   aa 0   10     1   2   bb 0   10     1   3   cc 1   11     2 2   12     3   1   cc 2   12     3   2   dd 3   13     4   Setup   df   pd DataFrame     aaa    0  10  1  11  2  12  3  13     myid    0  1  1  2  2  3  3  4     num    0   1  2  3   1      2   1  2   3         text    0    aa    bb    cc    1      2    cc    dd    3           CSV column   In  46   df Out 46           var1  var2 var3 0      a b c     1   XX 1  d e f x y     2   ZZ  In  47   explode df assign var1 df var1 str split         var1   Out 47     var1  var2 var3 0    a     1   XX 1    b     1   XX 2    c     1   XX 3    d     2   ZZ 4    e     2   ZZ 5    f     2   ZZ 6    x     2   ZZ 7    y     2   ZZ   using this little trick we can convert CSV-like column to list column   In  48   df assign var1 df var1 str split       Out 48                 var1  var2 var3 0         a  b  c      1   XX 1   d  e  f  x  y      2   ZZ     UPDATE  generic vectorized approach  will work also for multiple columns    Original DF   In  177   df Out 177           var1  var2 var3 0      a b c     1   XX 1  d e f x y     2   ZZ   Solution   first let s convert CSV strings to lists   In  178   lst col    var1    In  179   x   df assign    lst col df lst col  str split         In  180   x Out 180                 var1  var2 var3 0         a  b  c      1   XX 1   d  e  f  x  y      2   ZZ   Now we can do this   In  181   pd DataFrame                 col np repeat x col  values  x lst col  str len                  for col in x columns difference  lst col                assign    lst col np concatenate x lst col  values    x columns tolist              Out 181     var1  var2 var3 0    a     1   XX 1    b     1   XX 2    c     1   XX 3    d     2   ZZ 4    e     2   ZZ 5    f     2   ZZ 6    x     2   ZZ 7    y     2   ZZ     OLD answer   Inspired by  AFinkelstein solution  i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast  well almost  as fast as AFinkelstein s solution    In  2   df   pd DataFrame                var1    a b c    var2   1   var3    XX                  var1    d e f x y    var2   2   var3    ZZ               In  3   df Out 3           var1  var2 var3 0      a b c     1   XX 1  d e f x y     2   ZZ  In  4    df set index df columns drop  var1  1  tolist                var1 str split      expand True              stack               reset index               rename columns  0  var1                loc    df columns            Out 4     var1  var2 var3 0    a     1   XX 1    b     1   XX 2    c     1   XX 3    d     2   ZZ 4    e     2   ZZ 5    f     2   ZZ 6    x     2   ZZ 7    y     2   ZZ

User · Answer

After painful experimentation to find something faster than the accepted answer  I got this to work  It ran around 100x faster on the dataset I tried it on   If someone knows a way to make this more elegant  by all means please modify my code  I couldn t find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns  but I d imagine there s something else that works   b   DataFrame a var1 str split      tolist    index a var2  stack   b   b reset index    0   var2      var1 variable is currently labeled 0 b columns     var1    var2     renaming var1

User · Answer

I have come up with the following solution to this problem   def iter var1 d       for    row in d iterrows            for v in row  var1   split                   yield  v  row  var2     new a   DataFrame from records  i for i in iter var1 a            columns   var1    var2

[python] Split (explode) pandas dataframe string entry to separate rows

Examples related to python

Examples related to pandas

Examples related to numpy

Examples related to dataframe