How to read data when some numbers contain commas as thousand separator

Question

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator  e g   1 513  instead of 1513  What is the simplest way to read the data into R   I can use read csv      colClasses  character    but then I have to strip out the commas from the relevant elements before converting those columns to numeric  and I can t find a neat way to do that

User · Answer

a dplyr solution using mutate all and pipes  say you have the following    gt  dft Source  local data frame  11 x 5      Bureau Name Account Code   X2014   X2015   X2016 1       Senate          110 158 000 211 000 186 000 2       Senate          115       0       0       0 3       Senate          123  15 000  71 000  21 000 4       Senate          126   6 000  14 000   8 000 5       Senate          127 110 000 234 000 134 000 6       Senate          128 120 000 159 000 134 000 7       Senate          129       0       0       0 8       Senate          130 368 000 465 000 441 000 9       Senate          132       0       0       0 10      Senate          140       0       0       0 11      Senate          140       0       0       0   and want to remove commas from the year variables X2014-X2016  and convert them to numeric  also  let s say X2014-X2016 are read in as factors  default   dft   gt       mutate all funs as character      X2014 X2016    gt       mutate all funs gsub               X2014 X2016    gt       mutate all funs as numeric      X2014 X2016    mutate all applies the function s  inside funs to the specified columns  I did it sequentially  one function at a time  if you use multiple functions inside funs then you create additional  unnecessary columns

User · Answer

You can have read table or read csv do this conversion for you semi-automatically  First create a new class definition  then create a conversion function and set it as an  as  method using the setAs function like so   setClass  num with commas   setAs  character    num with commas            function from  as numeric gsub          from        Then run read csv like   DF  lt - read csv  your file here       colClasses c  num with commas   factor   character   numeric   num with commas

User · Answer

This question is several years old  but I stumbled upon it  which means maybe others will   The readr library   package has some nice features to it   One of them is a nice way to interpret  messy  columns  like these   library readr  read csv  numbers n800 n  1 800   n  3500   n6 5             col types   list col numeric                This yields  Source  local data frame  4 x 1     numbers      dbl  1   800 0 2  1800 0 3  3500 0 4     6 5     An important point when reading in files   you either have to pre-process  like the comment above regarding sed  or you have to process while reading   Often  if you try to fix things after the fact  there are some dangerous assumptions made that are hard to find    Which is why flat files are so evil in the first place    For instance  if I had not flagged the col types  I would have gotten this    gt  read csv  numbers n800 n  1 800   n  3500   n6 5   Source  local data frame  4 x 1     numbers      chr  1     800 2   1 800 3    3500 4     6 5    Notice that it is now a chr  character  instead of a numeric    Or  more dangerously  if it were long enough and most of the early elements did not contain commas    gt  set seed 1   gt  tmp  lt - as character sample c 1 10   100  replace TRUE    gt  tmp  lt - c tmp   1 003    gt  tmp  lt - paste tmp  collapse     n        such that the last few elements look like      5   n  9   n  7   n  1 003    Then you ll find trouble reading that comma at all    gt  tail read csv tmp   Source  local data frame  6 x 1        3     dbl  1 8 000 2 5 000 3 5 000 4 9 000 5 7 000 6 1 003 Warning message  1 problems parsing literal data  See problems      for more details

User · Answer

If number is separated by     and decimals by      1 200 000 00  in calling gsub you must set fixed TRUE as numeric gsub        y fixed TRUE

User · Answer

Not sure about how to have read csv interpret it properly  but you can use gsub to replace     with     and then convert the string to numeric using as numeric   y  lt - c  1 200   20 000   100   12 111   as numeric gsub          y      1   1200 20000 100 12111   This was also answered previously on R-Help  and in Q2 here    Alternatively  you can pre-process the file  for instance with sed in unix

User · Answer

A very convenient way is readr  read delim-family  Taking the example from here    Importing csv with multiple separators into R you can do it as follows   txt  lt -  OBJECTID District N ZONE CODE COUNT AREA SUM 1 Bagamoyo 1  136 227   8 514 187 500 000000000000000   352 678 813105723350000  2 Bariadi 2  88 350   5 521 875 000 000000000000000   526 307 288878142830000  3 Chunya 3  483 059   30 191 187 500 000000000000000   352 444 699742995200000    require readr  read csv txt      read delim txt  delim          Which results in the expected result     A tibble  3    6   OBJECTID District N ZONE CODE  COUNT        AREA      SUM       lt int gt        lt chr gt       lt int gt    lt dbl gt         lt dbl gt      lt dbl gt  1        1   Bagamoyo         1 136227  8514187500 352678 8 2        2    Bariadi         2  88350  5521875000 526307 3 3        3     Chunya         3 483059 30191187500 352444 7

User · Answer

I think preprocessing is the way to go  You could use Notepad   which has a regular expression replace option   For example  if your file were like this    1 234   123   1 234   234   123   1 234  123 456 789   Then  you could use the regular expression    0-9      0-9     and replace it with  1 2  1234  123  1234  234   123  1234 123 456 789   Then you could use  x  lt - read csv file  x csv  header FALSE  to read the file

User · Answer

Preprocess  in R   lines  lt -  www  rrr  1 234  ttt  n rrr zzz  1 234 567 987  rrr    Can use readLines on a textConnection  Then remove only the commas that are between digits   gsub    0-9        0-9        1  2   lines       1   www  rrr  1234  ttt  n rrr zzz  1234567987  rrr    It s als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read csv2  automagically  or read table with setting of the  dec -parameter     Edit  Later I discovered how to use colClasses by designing a new class  See   How to load df with 1000 separator in R as numeric class

User · Answer

I want to use R rather than pre-processing the data as it makes it easier when the data are revised  Following Shane s suggestion of using gsub  I think this is about as neat as I can do   x  lt - read csv  file csv  header TRUE colClasses  character   col2cvt  lt - 15 41 x  col2cvt   lt - lapply x  col2cvt  function x  as numeric gsub          x

User · Answer

We can also use readr  parse number  the columns must be characters though  If we want to apply it for multiple columns we can loop through columns using lapply  df 2 3   lt - lapply df 2 3   readr  parse number  df     a        b        c  1 a    12234       12  2 b      123  1234123  3 c     1234     1234  4 d 13456234    15342  5 e    12312 12334512   Or use mutate at from dplyr to apply it to specific variables    library dplyr  df   gt   mutate at 2 3  readr  parse number   Or df   gt   mutate at vars b c   readr  parse number    data  df  lt - data frame a   letters 1 5                     b   c  12 234    123    1 234    13 456 234    123 12                     c   c  12    1 234 123   1234    15 342    123 345 12                      stringsAsFactors   FALSE

User · Answer

Using read delim function  which is part of readr library  you can specify additional parameter   locale   locale decimal mark         read delim  filetoread csv        locale   locale decimal mark            Semicolon in second line means that read delim will read csv semicolon separated values   This will help to read all numbers with a comma as proper numbers   Regards  Mateusz Kania

[r] How to read data when some numbers contain commas as thousand separator?

a `dplyr` solution using `mutate_all` and pipes

Examples related to r

Examples related to csv

Examples related to r-faq

[r] How to read data when some numbers contain commas as thousand separator?

a dplyr solution using mutate_all and pipes

Examples related to r

Examples related to csv

Examples related to r-faq

a `dplyr` solution using `mutate_all` and pipes