Dealing with commas in a CSV file

Question

I am looking for suggestions on how to handle a csv file that is being created  then uploaded by our customers  and that may have a comma in a value  like a company name   Some of the ideas we are looking at are  quoted Identifiers  value     values    etc  or using a   instead of a comma   The biggest problem is that we have to make it easy  or the customer won t do it

User · Answer

As this is about general practices let s start from rules of the thumb    Don t use CSV  use XML with a library to read  amp  write the xml file instead  If you must use CSV  Do it properly and use a free library to parse and store the CSV files    To justify 1   most CSV parsers aren t encoding aware so if you aren t dealing with US-ASCII you are asking for troubles  For example excel 2002 is storing the CSV in local encoding without any note about the encoding  The CSV standard isn t widely adopted     On the other hand xml standard is well adopted and it handles encodings pretty well   To justify 2   There is tons of csv parsers around for almost all language so there is no need to reinvent the wheel even if the solutions looks pretty simple   To name few    for python use build in csv module for perl check CPAN and Text  CSV   for php use build in fgetcsv fputcsv functions for java check SuperCVS library   Really there is no need to implement this by hand if you aren t going to parse it on embedded device

User · Answer

Put double quotes around strings  That is generally what Excel does   Ala Eli      you escape a double quote as two   double quotes  E g     test1   foo  bar   test2

User · Answer

I usually do this in my CSV files parsing routines  Assume that  line  variable is one line within a CSV file and all of the columns  values are enclosed in double quotes  After the below two lines execute  you will get CSV columns in the  values  collection      The below two lines will split the columns as well as trim the DBOULE QUOTES around values but NOT within them     string trimmedLine   line Trim new char                  List lt string gt  values   trimmedLine Split new string                StringSplitOptions None  ToList

User · Answer

If you re interested in a more educational exercise on how to parse files in general  using CSV as an example   you may check out this article by Julian Bucknall  I like the article because it breaks things down into much smaller problems that are much less insurmountable  You first create a grammar  and once you have a good grammar  it s a relatively easy and methodical process to convert the grammar into code   The article uses C  and has a link at the bottom to download the code

User · Answer

You can use alternative  delimiters  like     or     but simplest might just be quoting which is supported by most  decent  CSV libraries and most decent spreadsheets   For more on CSV delimiters and a spec for a standard format for describing delimiters and quoting see this webpage

User · Answer

You can put double quotes around the fields  I don t like this approach  as it adds another special character  the double quote   Just define an escape character  usually backslash  and use it wherever you need to escape something   data more data more data   even yet more  You don t have to try to match quotes  and you have fewer exceptions to parse  This simplifies your code  too

User · Answer

In case you re on a  nix-system  have access to sed and there can be one or more unwanted commas only in a specific field of your CSV  you can use the following one-liner in order to enclose them in   as RFC4180 Section 2 proposes   sed -r  s                                   1  2  3   inputfile   Depending on which field the unwanted comma s  may be in you have to alter extend the capturing groups of the regex  and the substitution   The example above will enclose the fourth field  out of six  in quotation marks     In combination with the --in-place-option you can apply these changes directly to the file   In order to  build  the right regex  there s a simple principle to follow      For every field in your CSV that comes before the field with the unwanted comma s  you write one        and put them all together in a capturing group  For the field that contains the unwanted comma s  you write       For every field after the field with the unwanted comma s  you write one      and put them all together in a capturing group    Here is a short overview of different possible regexes substitutions depending on the specific field  If not given  the substitution is  1  2  3                                     first field  regex   1  2                            first field  substitution                                    last field  regex  1  2                             last field  substitution                                     second field  out of five fields                                    third field  out of four fields                                    fourth field  out of six fields    If you want to remove the unwanted comma s  with sed instead of enclosing them with quotation marks refer to this answer

User · Answer

Here s a neat little workaround   You can use a Greek Lower Numeral Sign instead  U 0375   It looks like this     Using this method saves you a lot of resources too

User · Answer

I used Csvreader library but by using that I got data by exploding from comma    in column value   So If you want to insert CSV file data which contains comma    in most of the columns values  you can use below function  Author link    https   gist github com jaywilliams 385876    function csv to array  filename      delimiter            if  file exists  filename      is readable  filename           return FALSE        header   NULL       data   array        if    handle   fopen  filename   r        FALSE                while    row   fgetcsv  handle  1000   delimiter       FALSE                        if   header                   header    row              else                  data     array combine  header   row                     fclose  handle             return  data

User · Answer

The CSV format uses commas to separate values  values which contain carriage returns  linefeeds  commas  or double quotes are surrounded by double-quotes   Values that contain double quotes are quoted and each literal quote is escaped by an immediately preceding quote  For example  the 3 values   test list  of  items  go  he said   would be encoded as   test  list  of  items     go   he said    Any field can be quoted but only fields that contain commas  CR NL  or quotes must be quoted   There is no real standard for the CSV format  but almost all applications follow the conventions documented here   The RFC that was mentioned elsewhere is not a standard for CSV  it is an RFC for using CSV within MIME and contains some unconventional and unnecessary limitations that make it useless outside of MIME   A gotcha that many CSV modules I have seen don t accommodate is the fact that multiple lines can be encoded in a single field which means you can t assume that each line is a separate record  you either need to not allow newlines in your data or be prepared to handle this

User · Answer

First  let s ask ourselves   Why do we feel the need to handle commas differently for CSV files    For me  the answer is   Because when I export data into a CSV file  the commas in a field disappear and my field gets separated into multiple fields where the commas appear in the original data     That it because the comma is the CSV field separator character    Depending on your situation  semi colons may also be used as CSV field separators   Given my requirements  I can use a character  e g   single low-9 quotation mark  that looks like a comma   So  here s how you can do it in Go      Replace special CSV characters with single low-9 quotation mark func Scrub a interface    string       s    fmt Sprint a      s   strings Replace s              -1      s   strings Replace s              -1      return s     The second comma looking character in the Replace function is decimal 8218   Be aware that if you have clients that may have ascii-only text readers that this decima 8218 character will not look like a comma   If this is your case  then I d recommend surrounding the field with the comma  or semicolon  with double quotes per RFC 4128  https   tools ietf org html rfc4180

User · Answer

There is a library available through nuget for dealing with pretty much any well formed CSV   net  - CsvHelper  Example to map to a class   var csv   new CsvReader  textReader    var records   csv GetRecords lt MyClass gt       Example to read individual fields   var csv   new CsvReader  textReader    while  csv Read           var intField   csv GetField lt int gt   0        var stringField   csv GetField lt string gt   1        var boolField   csv GetField lt bool gt    HeaderName         Letting the client drive the file format    is the standard field delimiter    is the standard value used to escape fields that contain a delimiter  quote  or line ending   To use  for example    for fields and   for escaping   var csv   new CsvReader  textReader    csv Configuration Delimiter        csv Configuration Quote           read the file however meets your needs   More Documentation

User · Answer

Just use SoftCircuits CsvParser on NuGet  It will handle all those details for you and efficiently handles very large files  And  if needed  it can even import export objects by mapping columns to object properties  In addition  my testing showed it averages nearly 4 times faster than the popular CsvHelper

User · Answer

As mentioned in my comment to harpo s answer  his solution is good and works in most cases  however in some scenarios when commas as directly adjacent to each other it fails to split on the commas   This is because of the Regex string behaving unexpectedly as a vertabim string  In order to get this behave correct  all   characters in the regex string need to be escaped manually without using the vertabim escape   Ie  The regex should be this using manual escapes                                                           which translates into                                           When using a vertabim string                                           it behaves as the following as you can see if you debug the regex                                        So in summary  I recommend harpo s solution  but watch out for this little gotcha   I ve included into the CsvReader a little optional failsafe to notify you if this error occurs  if you have a pre-known number of columns    if   expectedDataLength  gt  0  amp  amp  values Length     expectedDataLength   throw new DataLengthException string Format  Expected  0  columns when splitting csv  got  1     expectedDataLength  values Length      This can be injected via the constructor   public CsvReader string fileName  int expectedDataLength   0    this new FileStream fileName  FileMode Open  FileAccess Read          expectedDataLength   expectedDataLength

User · Answer

An example might help to show how commas can be displayed in a  csv file  Create a simple text file as follows   Save this text file as a text file with suffix   csv  and open it with Excel 2000 from Windows 10   aa bb cc d d  In the spreadsheet presentation  the below line should look like the above line except the below shows a displayed comma instead of a semicolon between the d s   aa bb cc  d d   This works even in Excel  aa bb cc  d d   This works even in Excel 2000  aa bb cc  d  d   This works even in Excel 2000  aa bb cc  d   d   This works even in Excel 2000   aa bb cc    d d   This fails in Excel 2000 due to the space belore the 1st quote aa bb cc    d  d   This fails in Excel 2000 due to the space belore the 1st quote aa bb cc    d   d   This fails in Excel 2000 due to the space belore the 1st quote  aa bb cc  d d     This works even in Excel 2000 even with spaces before and after the 2nd quote  aa bb cc  d  d     This works even in Excel 2000 even with spaces before and after the 2nd quote  aa bb cc  d   d     This works even in Excel 2000 even with spaces before and after the 2nd quote   Rule  If you want to display a comma in a a cell  field  of a  csv file   Start and end the field with a double quotes  but avoid white space before the 1st quote

User · Answer

You can read the csv file like this   this makes use of splits and takes care of spaces   ArrayList List   new ArrayList    static ServerSocket Server  static Socket socket  static ArrayList lt Object gt  list   new ArrayList lt Object gt       public static void ReadFromXcel   throws FileNotFoundException          File f   new File  Book csv        Scanner in   new Scanner f       int count   0      String   date      String   name      String   Temp   new String 10       String   Temp2   new String 10       String   numbers      ArrayList lt String   gt  List   new ArrayList lt String   gt         HashMap m   new HashMap              in nextLine             date   in nextLine   split                name   in nextLine   split                numbers   in nextLine   split                while in hasNext                            String   one   in nextLine   split                    List add one                       int xount   0             Making sure the lines don t start with a blank          for int y   0  y lt   date length-1  y                            if  date y  equals                                         Temp xount    date y                    Temp2 xount    name y                    xount                                        date   Temp           name  Temp2           int counter   0           while counter  lt  List size                            String   list   List get counter                String sNo   list 0                String Surname   list 1                String Name   list 2                for int x   3  x  lt  list length  x                                               m put numbers x   list x                               Object newOne   new newOne sNo  Name  Surname  m  false                StudentList add s                System out println s sNo                counter

User · Answer

I think the easiest solution to this problem is to have the customer to open the csv in excel  and then ctrl   r to replace all comma with whatever identifier you want  This is very easy for the customer and require only one change in your code to read the delimiter of your choice

User · Answer

In Europe we have this problem must earlier than this question  In Europe we use all a comma for a decimal point  See this numbers below     American        Europe            -------------   -------------     0 5             0 5               3 14159265359   3 14159265359     17 54           17 54             175 186 15      175 186 15        So it isn t possible to use the comma separator for CSV files  Because of that reason  the CSV files in Europe are separated by a semicolon        Programs like Microsoft Excel can read files with a semicolon and it s possible to switch from separator  You could even use a tab   t  as separator  See this answer from Supper User

User · Answer

Use a tab character   t  to separate the fields

User · Answer

Add a reference to the Microsoft VisualBasic  yes  it says VisualBasic but it works in C  just as well - remember that at the end it is all just IL     Use the Microsoft VisualBasic FileIO TextFieldParser class to parse CSV file Here is the sample code    Dim parser As TextFieldParser   New TextFieldParser  C  mar0112 csv    parser TextFieldType   FieldType Delimited  parser SetDelimiters                While Not parser EndOfData                 Processing row                    Dim fields   As String   parser ReadFields                For Each field As String In fields                        TODO  Process field                           Next             parser Close      End While

User · Answer

The simplest solution I ve found is the one LibreOffice uses    Replace all literal   by     Put double quotes around your string   You can also use the one that Excel uses    Replace all literal   by    Put double quotes around your string   Notice other people recommended to do only step 2 above  but that doesn t work with lines where a   is followed by a    like in a CSV where you want to have a single column with the string hello  world  as the CSV would read    hello  world    Which is interpreted as a row with two columns  hello and world

User · Answer

As others have said  you need to escape values that include quotes   Here   s a little CSV reader in C  that supports quoted values  including embedded quotes and carriage returns   By the way  this is unit-tested code   I   m posting it now because this question seems to come up a lot and others may not want an entire library when simple CSV support will do   You can use it as follows     using System  public class test       public static void Main                 using   CsvReader reader   new CsvReader   data csv                            foreach  string   values in reader RowEnumerator                                 Console WriteLine   Row  0  has  1  values    reader RowIndex  values Length                                    Console ReadLine              Here are the classes   Note that you can use the Csv Escape function to write valid CSV as well   using System IO  using System Text RegularExpressions   public sealed class CsvReader   System IDisposable       public CsvReader  string fileName     this  new FileStream  fileName  FileMode Open  FileAccess Read                      public CsvReader  Stream stream                   reader   new StreamReader  stream               public System Collections IEnumerable RowEnumerator               get               if   null      reader                   throw new System ApplicationException   I can t start reading without CSV input                     rowno   0              string sLine              string sNextLine               while   null      sLine     reader ReadLine                                     while   rexRunOnLine IsMatch  sLine    amp  amp  null      sNextLine     reader ReadLine                           sLine      n    sNextLine                     rowno                    string   values   rexCsvSplitter Split  sLine                     for   int i   0  i  lt  values Length  i                         values i    Csv Unescape  values i                      yield return values                               reader Close                         public long RowIndex   get   return   rowno           public void Dispose                 if   null      reader     reader Dispose                                                                    private long   rowno   0      private TextReader   reader      private static Regex rexCsvSplitter   new Regex                                                   private static Regex rexRunOnLine   new Regex                                                 public static class Csv       public static string Escape  string s                 if   s Contains  QUOTE                 s   s Replace  QUOTE  ESCAPED QUOTE             if   s IndexOfAny  CHARACTERS THAT MUST BE QUOTED    gt  -1               s   QUOTE   s   QUOTE           return s             public static string Unescape  string s                 if   s StartsWith  QUOTE    amp  amp  s EndsWith  QUOTE                           s   s Substring  1  s Length - 2                 if   s Contains  ESCAPED QUOTE                     s   s Replace  ESCAPED QUOTE  QUOTE                       return s              private const string QUOTE             private const string ESCAPED QUOTE               private static char   CHARACTERS THAT MUST BE QUOTED                 n

User · Answer

public static IEnumerable lt string gt  LineSplitter this string line  char           separator  char skip                      var fieldStart   0          for  var i   0  i  lt  line Length  i                          if  line i     separator                                yield return line Substring fieldStart  i - fieldStart                   fieldStart   i   1                            else if  i    line Length - 1                                yield return line Substring fieldStart  i - fieldStart   1                   fieldStart   i   1                             if  line i                          for  i    i  lt  line Length  amp  amp  line i     skip  i                           if  line line Length - 1     separator                        yield return string Empty

User · Answer

For 2017  csv is fully specified - RFC 4180   It is a very common specification  and is completely covered by many libraries  example    Simply use any easily-available csv library - that is to say RFC 4180     There s actually a spec for CSV format and how to handle commas        Fields containing line breaks  CRLF   double quotes  and commas should be enclosed in double-quotes    http   tools ietf org html rfc4180  So  to have values foo and bar baz  you do this   foo  bar baz    Another important requirement to consider  also from the spec       If double-quotes are used to enclose fields  then a double-quote   appearing inside a field must be escaped by preceding it with   another double quote   For example    aaa   b  bb   ccc

User · Answer

I generally URL-encode the fields which can have any commas or any special chars  And then decode it when it is being used displayed in any visual medium    commas becomes  2C   Every language should have methods to URL-encode and decode strings   e g   in java  URLEncoder encode myString  UTF-8      to encode URLDecoder decode myEncodedstring   UTF-8      to decode   I know this is a very general solution and it might not be ideal for situation where user wants to view content of csv file  manually

User · Answer

If you feel like reinventing the wheel  the following may work for you   public static IEnumerable lt string gt  SplitCSV string line        var s   new StringBuilder        bool escaped   false  inQuotes   false      foreach  char c in line                if  c         amp  amp   inQuotes                        yield return s ToString                s Clear                      else if  c          amp  amp   escaped                        escaped   true                    else if  c         amp  amp   escaped                        inQuotes    inQuotes                    else                       escaped   false              s Append c                       yield return s ToString

User · Answer

I used papaParse library to have the CSV file parsed and have the key-value pairs key header first row of CSV file-value     here is example that I use    https   codesandbox io embed llqmrp96pm  it has dummy csv file in there to have the CSV parsing demo   I ve used it within reactJS though it is easy and simple to replicate in app written with any language

[csv] Dealing with commas in a CSV file

Examples related to csv