What I am trying to accomplish is splitting a column into multiple columns. I would prefer the first column to contain "F", second column "US", third "CA6" or "DL", and the fourth to be "Z13" or "U13" etc etc. My entire df follows the same pattern of X.XX.XXXX.XXX or X.XX.XXX.XXX or X.XX.XX.XXX and I know the third column is where my problem lies because of the different lengths. I have only used substr in the past and I could use that here with some if statements but would like to learn how to use stringr package and POSIX to do this (unless there is a better option). Thank you in advance.
Here is my df:
c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)
We could use tidyr::extract()
x <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)
library(tidyr)
extract(tibble(data=x),"data", regex = "^(.*?)\\.(.*?)\\.(.*?)\\.(.*?)$",into = LETTERS[1:4])
#> # A tibble: 13 x 4
#> A B C D
#> <chr> <chr> <chr> <chr>
#> 1 F US CLE V13
#> 2 F US CA6 U13
#> 3 F US CA6 U13
#> 4 F US CA6 U13
#> 5 F US CA6 U13
#> 6 F US CA6 U13
#> 7 F US CA6 U13
#> 8 F US CA6 U13
#> 9 F US DL U13
#> 10 F US DL U13
#> 11 F US DL U13
#> 12 F US DL Z13
#> 13 F US DL Z13
Another option is to use unglue::unglue_data()
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_data(x,"{A}.{B}.{C}.{D}")
#> A B C D
#> 1 F US CLE V13
#> 2 F US CA6 U13
#> 3 F US CA6 U13
#> 4 F US CA6 U13
#> 5 F US CA6 U13
#> 6 F US CA6 U13
#> 7 F US CA6 U13
#> 8 F US CA6 U13
#> 9 F US DL U13
#> 10 F US DL U13
#> 11 F US DL U13
#> 12 F US DL Z13
#> 13 F US DL Z13
Created on 2019-09-14 by the reprex package (v0.3.0)
Is this what you are trying to do?
# Our data
text <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)
# Split into individual elements by the '.' character
# Remember to escape it, because '.' by itself matches any single character
elems <- unlist( strsplit( text , "\\." ) )
# We know the dataframe should have 4 columns, so make a matrix
m <- matrix( elems , ncol = 4 , byrow = TRUE )
# Coerce to data.frame - head() is just to illustrate the top portion
head( as.data.frame( m ) )
# V1 V2 V3 V4
#1 F US CLE V13
#2 F US CA6 U13
#3 F US CA6 U13
#4 F US CA6 U13
#5 F US CA6 U13
#6 F US CA6 U13
The way via unlist
and matrix
seems a bit convoluted, and requires you to hard-code the number of elements (this is actually a pretty big no-go. Of course you could circumvent hard-coding that number and determine it at run-time)
I would go a different route, and construct a data frame directly from the list that strsplit
returns. For me, this is conceptually simpler. There are essentially two ways of doing this:
as.data.frame
– but since the list is exactly the wrong way round (we have a list of rows rather than a list of columns) we have to transpose the result. We also clear the rownames
since they are ugly by default (but that’s strictly unnecessary!):
`rownames<-`(t(as.data.frame(strsplit(text, '\\.'))), NULL)
Alternatively, use rbind
to construct a data frame from the list of rows. We use do.call
to call rbind
with all the rows as separate arguments:
do.call(rbind, strsplit(text, '\\.'))
Both ways yield the same result:
[,1] [,2] [,3] [,4]
[1,] "F" "US" "CLE" "V13"
[2,] "F" "US" "CA6" "U13"
[3,] "F" "US" "CA6" "U13"
[4,] "F" "US" "CA6" "U13"
[5,] "F" "US" "CA6" "U13"
[6,] "F" "US" "CA6" "U13"
…
Clearly, the second way is much simpler than the first.
Source: Stackoverflow.com