[r] Splitting a dataframe string column into multiple different columns

What I am trying to accomplish is splitting a column into multiple columns. I would prefer the first column to contain "F", second column "US", third "CA6" or "DL", and the fourth to be "Z13" or "U13" etc etc. My entire df follows the same pattern of X.XX.XXXX.XXX or X.XX.XXX.XXX or X.XX.XX.XXX and I know the third column is where my problem lies because of the different lengths. I have only used substr in the past and I could use that here with some if statements but would like to learn how to use stringr package and POSIX to do this (unless there is a better option). Thank you in advance.

Here is my df:

c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)

This question is related to r split dataframe stringr

The answer is


We could use tidyr::extract()

x <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
  "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
  "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)


library(tidyr)
extract(tibble(data=x),"data", regex = "^(.*?)\\.(.*?)\\.(.*?)\\.(.*?)$",into = LETTERS[1:4])
#> # A tibble: 13 x 4
#>    A     B     C     D    
#>    <chr> <chr> <chr> <chr>
#>  1 F     US    CLE   V13  
#>  2 F     US    CA6   U13  
#>  3 F     US    CA6   U13  
#>  4 F     US    CA6   U13  
#>  5 F     US    CA6   U13  
#>  6 F     US    CA6   U13  
#>  7 F     US    CA6   U13  
#>  8 F     US    CA6   U13  
#>  9 F     US    DL    U13  
#> 10 F     US    DL    U13  
#> 11 F     US    DL    U13  
#> 12 F     US    DL    Z13  
#> 13 F     US    DL    Z13

Another option is to use unglue::unglue_data()

# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_data(x,"{A}.{B}.{C}.{D}")
#>    A  B   C   D
#> 1  F US CLE V13
#> 2  F US CA6 U13
#> 3  F US CA6 U13
#> 4  F US CA6 U13
#> 5  F US CA6 U13
#> 6  F US CA6 U13
#> 7  F US CA6 U13
#> 8  F US CA6 U13
#> 9  F US  DL U13
#> 10 F US  DL U13
#> 11 F US  DL U13
#> 12 F US  DL Z13
#> 13 F US  DL Z13

Created on 2019-09-14 by the reprex package (v0.3.0)


Is this what you are trying to do?

# Our data
text <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)

#  Split into individual elements by the '.' character
#  Remember to escape it, because '.' by itself matches any single character
elems <- unlist( strsplit( text , "\\." ) )

#  We know the dataframe should have 4 columns, so make a matrix
m <- matrix( elems , ncol = 4 , byrow = TRUE )

#  Coerce to data.frame - head() is just to illustrate the top portion
head( as.data.frame( m ) )
#  V1 V2  V3  V4
#1  F US CLE V13
#2  F US CA6 U13
#3  F US CA6 U13
#4  F US CA6 U13
#5  F US CA6 U13
#6  F US CA6 U13

The way via unlist and matrix seems a bit convoluted, and requires you to hard-code the number of elements (this is actually a pretty big no-go. Of course you could circumvent hard-coding that number and determine it at run-time)

I would go a different route, and construct a data frame directly from the list that strsplit returns. For me, this is conceptually simpler. There are essentially two ways of doing this:

  1. as.data.frame – but since the list is exactly the wrong way round (we have a list of rows rather than a list of columns) we have to transpose the result. We also clear the rownames since they are ugly by default (but that’s strictly unnecessary!):

    `rownames<-`(t(as.data.frame(strsplit(text, '\\.'))), NULL)
    
  2. Alternatively, use rbind to construct a data frame from the list of rows. We use do.call to call rbind with all the rows as separate arguments:

    do.call(rbind, strsplit(text, '\\.'))
    

Both ways yield the same result:

     [,1] [,2] [,3]  [,4]
[1,] "F"  "US" "CLE" "V13"
[2,] "F"  "US" "CA6" "U13"
[3,] "F"  "US" "CA6" "U13"
[4,] "F"  "US" "CA6" "U13"
[5,] "F"  "US" "CA6" "U13"
[6,] "F"  "US" "CA6" "U13"
…

Clearly, the second way is much simpler than the first.


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to split

Parameter "stratify" from method "train_test_split" (scikit Learn) Pandas split DataFrame by column value How to split large text file in windows? Attribute Error: 'list' object has no attribute 'split' Split function in oracle to comma separated values with automatic sequence How would I get everything before a : in a string Python Split String by delimiter position using oracle SQL JavaScript split String with white space Split a String into an array in Swift? Split pandas dataframe in two if it has more than 10 rows

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to stringr

Splitting a dataframe string column into multiple different columns