How to split a dataframe string column into two columns

Question

I have a data frame with one  string  column and I d like to split it into two  string  columns  with one column header as  fips  and the other  row  My dataframe df looks like this            row 0    00000 UNITED STATES 1    01000 ALABAMA 2    01001 Autauga County  AL 3    01003 Baldwin County  AL 4    01005 Barbour County  AL  I do not know how to use df row str    to achieve my goal of splitting the row cell  I can use df  fips     hello to add a new column and populate it with hello  Any ideas           fips       row 0    00000 UNITED STATES 1    01000 ALABAMA  2    01001 Autauga County  AL 3    01003 Baldwin County  AL 4    01005 Barbour County  AL

User · Answer

TL DR version  For the simple case of   I have a text column with a delimiter and I want two columns  The simplest solution is  df   A    B      df  AB   str split      1  expand True   You must use expand True if your strings have a non-uniform number of splits and you want None to replace the missing values  Notice how  in either case  the  tolist   method is not necessary  Neither is zip    In detail  Andy Hayden s solution is most excellent in demonstrating the power of the str extract   method  But for a simple split over a known separator  like  splitting by dashes  or splitting by whitespace   the  str split   method is enough1  It operates on a column  Series  of strings  and returns a column  Series  of lists   gt  gt  gt  import pandas as pd  gt  gt  gt  df   pd DataFrame   AB     A1-B1    A2-B2      gt  gt  gt  df        AB 0  A1-B1 1  A2-B2  gt  gt  gt  df  AB split     df  AB   str split  -    gt  gt  gt  df        AB  AB split 0  A1-B1   A1  B1  1  A2-B2   A2  B2   1  If you re unsure what the first two parameters of  str split   do  I recommend the docs for the plain Python version of the method  But how do you go from   a column containing two-element lists  to   two columns  each containing the respective element of the lists   Well  we need to take a closer look at the  str attribute of a column  It s a magical object that is used to collect methods that treat each element in a column as a string  and then apply the respective method in each element as efficient as possible   gt  gt  gt  upper lower df   pd DataFrame   quot U quot     quot A quot    quot B quot    quot C quot      gt  gt  gt  upper lower df     U 0  A 1  B 2  C  gt  gt  gt  upper lower df  quot L quot     upper lower df  quot U quot   str lower    gt  gt  gt  upper lower df     U  L 0  A  a 1  B  b 2  C  c  But it also has an  quot indexing quot  interface for getting each element of a string by its index   gt  gt  gt  df  AB   str 0   0    A 1    A Name  AB  dtype  object   gt  gt  gt  df  AB   str 1   0    1 1    2 Name  AB  dtype  object  Of course  this indexing interface of  str doesn t really care if each element it s indexing is actually a string  as long as it can be indexed  so   gt  gt  gt  df  AB   str split  -   1  str 0   0    A1 1    A2 Name  AB  dtype  object   gt  gt  gt  df  AB   str split  -   1  str 1   0    B1 1    B2 Name  AB  dtype  object  Then  it s a simple matter of taking advantage of the Python tuple unpacking of iterables to do  gt  gt  gt  df  A    df  B     df  AB   str split  -   1  str  gt  gt  gt  df        AB  AB split   A   B 0  A1-B1   A1  B1   A1  B1 1  A2-B2   A2  B2   A2  B2  Of course  getting a DataFrame out of splitting a column of strings is so useful that the  str split   method can do it for you with the expand True parameter   gt  gt  gt  df  AB   str split  -   1  expand True       0   1 0  A1  B1 1  A2  B2  So  another way of accomplishing what we wanted is to do   gt  gt  gt  df   df   AB     gt  gt  gt  df        AB 0  A1-B1 1  A2-B2   gt  gt  gt  df join df  AB   str split  -   1  expand True  rename columns  0  A   1  B            AB   A   B 0  A1-B1  A1  B1 1  A2-B2  A2  B2  The expand True version  although longer  has a distinct advantage over the tuple unpacking method  Tuple unpacking doesn t deal well with splits of different lengths   gt  gt  gt  df   pd DataFrame   AB     A1-B1    A2-B2    A3-B3-C3      gt  gt  gt  df          AB 0     A1-B1 1     A2-B2 2  A3-B3-C3  gt  gt  gt  df  A    df  B    df  C     df  AB   str split  -   Traceback  most recent call last               ValueError  Length of values does not match length of index  gt  gt  gt    But expand True handles it nicely by placing None in the columns for which there aren t enough  quot splits quot    gt  gt  gt  df join          df  AB   str split  -   expand True  rename              columns  0  A   1  B   2  C                            AB   A   B     C 0     A1-B1  A1  B1  None 1     A2-B2  A2  B2  None 2  A3-B3-C3  A3  B3    C3

User · Answer

I prefer exporting the corresponding pandas series  i e  the columns I need   using the apply function to split the column content into multiple series and then join the generated columns to the existing DataFrame  Of course  the source column should be removed   e g     col1   df   lt col name gt    apply  lt function gt    col2        df   df join col1 to frame name   lt name1 gt      df   df join col2 toframe name   lt name2 gt      df   df drop    lt col name gt     axis 1    To split two words strings function should be something like that   lambda x  x split      0    for the first element lambda x  x split      -1    for the last element

User · Answer

You can use str split by whitespace  default separator  and parameter expand True for DataFrame with assign to new columns   df   pd DataFrame   row     00000 UNITED STATES    01000 ALABAMA                                01001 Autauga County  AL    01003 Baldwin County  AL                                01005 Barbour County  AL     print  df                          row 0       00000 UNITED STATES 1             01000 ALABAMA 2  01001 Autauga County  AL 3  01003 Baldwin County  AL 4  01005 Barbour County  AL    df   a   b      df  row   str split n 1  expand True  print  df                          row      a                   b 0       00000 UNITED STATES  00000       UNITED STATES 1             01000 ALABAMA  01000             ALABAMA 2  01001 Autauga County  AL  01001  Autauga County  AL 3  01003 Baldwin County  AL  01003  Baldwin County  AL 4  01005 Barbour County  AL  01005  Barbour County  AL   Modification if need remove original column with DataFrame pop  df   a   b      df pop  row   str split n 1  expand True  print  df         a                   b 0  00000       UNITED STATES 1  01000             ALABAMA 2  01001  Autauga County  AL 3  01003  Baldwin County  AL 4  01005  Barbour County  AL   What is same like   df   a   b      df  row   str split n 1  expand True  df   df drop  row   axis 1  print  df          a                   b 0  00000       UNITED STATES 1  01000             ALABAMA 2  01001  Autauga County  AL 3  01003  Baldwin County  AL 4  01005  Barbour County  AL     If get error    remove n 1 for split by all whitespaces df   a   b      df  row   str split expand True       ValueError  Columns must be same length as key   You can check and it return 4 column DataFrame  not only 2   print  df  row   str split expand True          0        1        2     3 0  00000   UNITED   STATES  None 1  01000  ALABAMA     None  None 2  01001  Autauga  County     AL 3  01003  Baldwin  County     AL 4  01005  Barbour  County     AL   Then solution is append new DataFrame by join   df   pd DataFrame   row     00000 UNITED STATES    01000 ALABAMA                                01001 Autauga County  AL    01003 Baldwin County  AL                                01005 Barbour County  AL                         a  range 5    print  df     a                       row 0  0       00000 UNITED STATES 1  1             01000 ALABAMA 2  2  01001 Autauga County  AL 3  3  01003 Baldwin County  AL 4  4  01005 Barbour County  AL  df   df join df  row   str split expand True   print  df      a                       row      0        1        2     3 0  0       00000 UNITED STATES  00000   UNITED   STATES  None 1  1             01000 ALABAMA  01000  ALABAMA     None  None 2  2  01001 Autauga County  AL  01001  Autauga  County     AL 3  3  01003 Baldwin County  AL  01003  Baldwin  County     AL 4  4  01005 Barbour County  AL  01005  Barbour  County     AL   With remove original column  if there are also another columns    df   df join df pop  row   str split expand True   print  df     a      0        1        2     3 0  0  00000   UNITED   STATES  None 1  1  01000  ALABAMA     None  None 2  2  01001  Autauga  County     AL 3  3  01003  Baldwin  County     AL 4  4  01005  Barbour  County     AL

User · Answer

You can extract the different parts out quite neatly using a regex pattern   In  11   df row str extract    P lt fips gt  d 5     P lt state gt  A-Z        P lt county gt         P lt state code gt  A-Z  2       Out 11        fips                    1           state           county state code 0  00000        UNITED STATES   UNITED STATES              NaN        NaN 1  01000              ALABAMA         ALABAMA              NaN        NaN 2  01001   Autauga County  AL             NaN   Autauga County         AL 3  01003   Baldwin County  AL             NaN   Baldwin County         AL 4  01005   Barbour County  AL             NaN   Barbour County         AL   5 rows x 5 columns      To explain the somewhat long regex     P lt fips gt  d 5      Matches the five digits   d  and names them  fips     The next part      P lt state gt  A-Z        P lt county gt         P lt state code gt  A-Z  2       Does either     one of two things     P lt state gt  A-Z         Matches any number     of capital letters or spaces   A-Z    and names this  state  before the end of the string        or    P lt county gt         P lt state code gt  A-Z  2        matches anything else      then a comma and a space then   matches the two digit state code before the end of the string        In the example  Note that the first two rows hit the  state   leaving NaN in  the county and state code columns   whilst the last three hit the county  state code  leaving NaN in the state column

User · Answer

If you don t want to create a new dataframe  or if your dataframe has more columns than just the ones you want to split  you could   df  flips    df  row name     zip  df  row   str split   tolist    del df  row

User · Answer

There might be a better way  but this here s one approach                               row     0       00000 UNITED STATES     1             01000 ALABAMA     2  01001 Autauga County  AL     3  01003 Baldwin County  AL     4  01005 Barbour County  AL   df   pd DataFrame df row str split     1  tolist                                     columns     flips   row         flips                 row 0  00000       UNITED STATES 1  01000             ALABAMA 2  01001  Autauga County  AL 3  01003  Baldwin County  AL 4  01005  Barbour County  AL

User · Answer

df   fips    row      df  row   str split      n 1  expand True

User · Answer

I saw that no one had used the slice method  so here I put my 2 cents here   df   lt col name gt    str slice stop 5  df   lt col name gt    str slice start 6    This method will create two new columns

User · Answer

If you want to split a string into more than two columns based on a delimiter you can omit the  maximum splits  parameter  You can use    df  column name   str split      expand True    This will automatically create as many columns as the maximum number of fields included in any of your initial strings

User · Answer

Surprised I haven t seen this one yet  If you only need two splits  I highly recommend       Series str partition  partition performs one split on the separator  and is generally quite performant   df  row   str partition       0  2           0                   2 0  00000       UNITED STATES 1  01000             ALABAMA 2  01001  Autauga County  AL 3  01003  Baldwin County  AL 4  01005  Barbour County  AL   If you need to rename the rows    df  row   str partition       0  2   rename  0   fips   2   row    axis 1       fips                 row 0  00000       UNITED STATES 1  01000             ALABAMA 2  01001  Autauga County  AL 3  01003  Baldwin County  AL 4  01005  Barbour County  AL   If you need to join this back to the original  use join or concat   df join df  row   str partition       0  2        pd concat  df  df  row   str partition       0  2     axis 1                           row      0                   2 0       00000 UNITED STATES  00000       UNITED STATES 1             01000 ALABAMA  01000             ALABAMA 2  01001 Autauga County  AL  01001  Autauga County  AL 3  01003 Baldwin County  AL  01003  Baldwin County  AL 4  01005 Barbour County  AL  01005  Barbour County  AL

User · Answer

Use df assign to create a new df  See http   pandas pydata org pandas-docs stable user guide indexing html returning-a-view-versus-a-copy  split   df selected  name   str split      1  expand True  df split   df selected assign first name split 0   last name split 1   df split drop  name   1  inplace True

[python] How to split a dataframe string column into two columns?

TL;DR version:

In detail:

Examples related to python

Examples related to dataframe

Examples related to pandas