Pandas column of lists create a row for each list element

Question

I have a dataframe where some cells contain lists of multiple values  Rather than storing multiple values in a cell  I d like to expand the dataframe so that each item in the list gets its own row  with the same values in all other columns   So if I have   import pandas as pd import numpy as np  df   pd DataFrame        trial num    1  2  3  1  2  3         subject    1  1  1  2  2  2         samples    list np random randn 3  round 2   for i in range 6            df Out 10                     samples  subject  trial num 0     0 57  -0 83  1 44         1          1 1     -0 01  1 13  0 36         1          2 2    1 18  -1 46  -0 94         1          3 3   -0 08  -4 22  -2 05         2          1 4      0 72  0 79  0 53         2          2 5     0 4  -0 32  -0 13         2          3   How do I convert to long form  e g       subject  trial num  sample  sample num 0        1          1    0 57           0 1        1          1   -0 83           1 2        1          1    1 44           2 3        1          2   -0 01           0 4        1          2    1 13           1 5        1          2    0 36           2 6        1          3    1 18           0   etc    The index is not important  it s OK to set existing columns as the index and the final ordering isn t important

User · Answer

Trying to work through Roman Pekar s solution step-by-step to understand it better  I came up with my own solution  which uses melt to avoid some of the confusing stacking and index resetting  I can t say that it s obviously a clearer solution though   items as cols   df apply lambda x  pd Series x  samples     axis 1    Keep original df index as a column so it s retained after melt items as cols  orig index     items as cols index  melted items   pd melt items as cols  id vars  orig index                           var name  sample num   value name  sample   melted items set index  orig index   inplace True   df merge melted items  left index True  right index True    Output  obviously we can drop the original samples column now                     samples  subject  trial num sample num  sample 0     1 84  1 05  -0 66         1          1          0    1 84 0     1 84  1 05  -0 66         1          1          1    1 05 0     1 84  1 05  -0 66         1          1          2   -0 66 1     -0 24  -0 9  0 65         1          2          0   -0 24 1     -0 24  -0 9  0 65         1          2          1   -0 90 1     -0 24  -0 9  0 65         1          2          2    0 65 2     1 15  -0 87  -1 1         1          3          0    1 15 2     1 15  -0 87  -1 1         1          3          1   -0 87 2     1 15  -0 87  -1 1         1          3          2   -1 10 3    -0 8  -0 62  -0 68         2          1          0   -0 80 3    -0 8  -0 62  -0 68         2          1          1   -0 62 3    -0 8  -0 62  -0 68         2          1          2   -0 68 4     0 91  -0 47  1 43         2          2          0    0 91 4     0 91  -0 47  1 43         2          2          1   -0 47 4     0 91  -0 47  1 43         2          2          2    1 43 5   -1 14  -0 24  -0 91         2          3          0   -1 14 5   -1 14  -0 24  -0 91         2          3          1   -0 24 5   -1 14  -0 24  -0 91         2          3          2   -0 91

User · Answer

Also very late  but here is an answer from Karvy1 that worked well for me if you don t have pandas   0 25 version   https   stackoverflow com a 52511166 10740287  For the example above you may write    data     row subject  row trial num  sample  for row in df itertuples   for sample in row samples  data   pd DataFrame data  columns   subject    trial num    samples        Speed test     timeit data   pd DataFrame   row subject  row trial num  sample  for row in df itertuples   for sample in row samples   columns   subject    trial num    samples      1 33 ms    74 8   s per loop  mean    std  dev  of 7 runs  1000 loops each    timeit data   df set index   subject    trial num     samples   apply pd Series  stack   reset index     4 9 ms    189   s per loop  mean    std  dev  of 7 runs  100 loops each    timeit data   pd DataFrame  col np repeat df col  values  df  samples   str len   for col in df columns drop  samples     assign     samples  np concatenate df  samples   values      1 38 ms    25   s per loop  mean    std  dev  of 7 runs  1000 loops each

User · Answer

you can also use pd concat and pd melt for this    gt  gt  gt  objs    df  pd DataFrame df  samples   tolist      gt  gt  gt  pd concat objs  axis 1  drop  samples   axis 1     subject  trial num     0     1     2 0        1          1 -0 49 -1 00  0 44 1        1          2 -0 28  1 48  2 01 2        1          3 -0 52 -1 84  0 02 3        2          1  1 23 -1 36 -1 06 4        2          2  0 54  0 18  0 51 5        2          3 -2 18 -0 13 -1 35  gt  gt  gt  pd melt    var name  sample num   value name  sample                value vars  0  1  2   id vars   subject    trial num        subject  trial num sample num  sample 0         1          1          0   -0 49 1         1          2          0   -0 28 2         1          3          0   -0 52 3         2          1          0    1 23 4         2          2          0    0 54 5         2          3          0   -2 18 6         1          1          1   -1 00 7         1          2          1    1 48 8         1          3          1   -1 84 9         2          1          1   -1 36 10        2          2          1    0 18 11        2          3          1   -0 13 12        1          1          2    0 44 13        1          2          2    2 01 14        1          3          2    0 02 15        2          1          2   -1 06 16        2          2          2    0 51 17        2          3          2   -1 35   last  if you need you can sort base on the first the first three columns

User · Answer

Very late answer but I want to add this   A fast solution using vanilla Python that also takes care of the sample num column in OP s example  On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds  The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM   df   df reset index drop True  lstcol   df lstcol values lstcollist      indexlist      countlist      for ii in range len lstcol        lstcollist extend lstcol ii       indexlist extend  ii  len lstcol ii        countlist extend  jj for jj in range len lstcol ii      df   pd merge df drop  lstcol  axis 1  pd DataFrame   lstcol  lstcollist  lstcol num  countlist   index indexlist  left index True right index True  reset index drop True

User · Answer

For those looking for a version of Roman Pekar s answer that avoids manual column naming   column to explode    samples  res    df         set index  x for x in df columns if x    column to explode   column to explode          apply pd Series          stack           reset index    res   res rename columns             res columns -2   exploded    index  format column to explode             res columns -1       exploded  format column to explode

User · Answer

lst col    samples   r   pd DataFrame         col np repeat df col  values  df lst col  str len          for col in df columns drop lst col         assign    lst col np concatenate df lst col  values    df columns    Result   In  103   r Out 103       samples  subject  trial num 0      0 10        1          1 1     -0 20        1          1 2      0 05        1          1 3      0 25        1          2 4      1 32        1          2 5     -0 17        1          2 6      0 64        1          3 7     -0 22        1          3 8     -0 71        1          3 9     -0 03        2          1 10    -0 65        2          1 11     0 76        2          1 12     1 77        2          2 13     0 89        2          2 14     0 65        2          2 15    -0 98        2          3 16     0 65        2          3 17    -0 30        2          3   PS here you may find a bit more generic solution    UPDATE  some explanations  IMO the easiest way to understand this code is to try to execute it step-by-step   in the following line we are repeating values in one column N times where N - is the length of the corresponding list   In  10   np repeat df  trial num   values  df lst col  str len    Out 10   array  1  1  1  2  2  2  3  3  3  1  1  1  2  2  2  3  3  3   dtype int64    this can be generalized for all columns  containing scalar values   In  11   pd DataFrame                      col np repeat df col  values  df lst col  str len                       for col in df columns drop lst col                      Out 11       trial num  subject 0           1        1 1           1        1 2           1        1 3           2        1 4           2        1 5           2        1 6           3        1                        11          1        2 12          2        2 13          2        2 14          2        2 15          3        2 16          3        2 17          3        2   18 rows x 2 columns    using np concatenate   we can flatten all values in the list column  samples  and get a 1D vector   In  12   np concatenate df lst col  values  Out 12   array  -1 04  -0 58  -1 32   0 82  -0 59  -0 34   0 25   2 09   0 12   0 83  -0 88   0 68   0 55  -0 56   0 65  -0 04   0 36  -0 31     putting all this together   In  13   pd DataFrame                      col np repeat df col  values  df lst col  str len                       for col in df columns drop lst col                      assign    lst col np concatenate df lst col  values    Out 13       trial num  subject  samples 0           1        1    -1 04 1           1        1    -0 58 2           1        1    -1 32 3           2        1     0 82 4           2        1    -0 59 5           2        1    -0 34 6           3        1     0 25                                 11          1        2     0 68 12          2        2     0 55 13          2        2    -0 56 14          2        2     0 65 15          3        2    -0 04 16          3        2     0 36 17          3        2    -0 31   18 rows x 3 columns    using pd DataFrame   df columns  will guarantee that we are selecting columns in the original order

User · Answer

Pandas    0 25  Series and DataFrame methods define a  explode   method that explodes lists into separate rows  See the docs section on Exploding a list-like column   df   pd DataFrame        var1      a    b    c      d    e         np nan         var2    1  2  3  4     df         var1  var2 0   a  b  c      1 1      d  e      2 2                3 3        NaN     4  df explode  var1      var1  var2 0    a     1 0    b     1 0    c     1 1    d     2 1    e     2 2  NaN     3    empty list converted to NaN 3  NaN     4    NaN entry preserved as-is    to reset the index to be monotonically increasing    df explode  var1   reset index drop True     var1  var2 0    a     1 1    b     1 2    c     1 3    d     2 4    e     2 5  NaN     3 6  NaN     4   Note that this also handles mixed columns of lists and scalars  as well as empty lists and NaNs appropriately  this is a drawback of repeat-based solutions    However  you should note that explode only works on a single column  for now    P S   if you are looking to explode a column of strings  you need to split on a separator first  then use explode  See this  very much  related answer by me

User · Answer

A bit longer than I expected    gt  gt  gt  df                 samples  subject  trial num 0   -0 07  -2 9  -2 44         1          1 1    -1 52  -0 35  0 1         1          2 2   -0 17  0 57  -0 65         1          3 3   -0 82  -1 06  0 47         2          1 4    0 79  1 35  -0 09         2          2 5    1 17  1 14  -1 79         2          3  gt  gt  gt   gt  gt  gt  s   df apply lambda x  pd Series x  samples    axis 1  stack   reset index level 1  drop True   gt  gt  gt  s name    sample   gt  gt  gt   gt  gt  gt  df drop  samples   axis 1  join s     subject  trial num  sample 0        1          1   -0 07 0        1          1   -2 90 0        1          1   -2 44 1        1          2   -1 52 1        1          2   -0 35 1        1          2    0 10 2        1          3   -0 17 2        1          3    0 57 2        1          3   -0 65 3        2          1   -0 82 3        2          1   -1 06 3        2          1    0 47 4        2          2    0 79 4        2          2    1 35 4        2          2   -0 09 5        2          3    1 17 5        2          3    1 14 5        2          3   -1 79   If you want sequential index  you can apply reset index drop True  to the result   update    gt  gt  gt  res   df set index   subject    trial num     samples   apply pd Series  stack    gt  gt  gt  res   res reset index    gt  gt  gt  res columns     subject   trial num   sample num   sample    gt  gt  gt  res     subject  trial num  sample num  sample 0         1          1           0    1 89 1         1          1           1   -2 92 2         1          1           2    0 34 3         1          2           0    0 85 4         1          2           1    0 24 5         1          2           2    0 72 6         1          3           0   -0 96 7         1          3           1   -2 72 8         1          3           2   -0 11 9         2          1           0   -1 33 10        2          1           1    3 13 11        2          1           2   -0 65 12        2          2           0    0 10 13        2          2           1    0 65 14        2          2           2    0 15 15        2          3           0    0 64 16        2          3           1   -0 10 17        2          3           2   -0 76

User · Answer

import pandas as pd df   pd DataFrame    Product    Coke    Prices    100 123 101 105 99 94 98     Product    Pepsi    Prices    101 104 104 101 99 99 99     print df  df   df assign Prices df Prices str split       explode  Prices   print df    Try this in pandas   0 25 version

User · Answer

I found the easiest way was to    Convert the samples column into a DataFrame Joining with the original df Melting   Shown here       df samples apply lambda x  pd Series x   join df    melt   subject   trial num    0 1 2  var name  sample            subject  trial num sample  value     0         1          1      0  -0 24     1         1          2      0   0 14     2         1          3      0  -0 67     3         2          1      0  -1 52     4         2          2      0  -0 00     5         2          3      0  -1 73     6         1          1      1  -0 70     7         1          2      1  -0 70     8         1          3      1  -0 29     9         2          1      1  -0 70     10        2          2      1  -0 72     11        2          3      1   1 30     12        1          1      2  -0 55     13        1          2      2   0 10     14        1          3      2  -0 44     15        2          1      2   0 13     16        2          2      2  -1 44     17        2          3      2   0 73   It s worth noting that this may have only worked because each trial has the same number of samples  3   Something more clever may be necessary for trials of different sample sizes

[python] Pandas column of lists, create a row for each list element

Examples related to python

Examples related to pandas

Examples related to list