Java splitting a comma-separated string but ignoring commas in quotes

Question

I have a string vaguely like this   foo bar c qual  baz blurb  d junk  quux syzygy    that I want to split by commas -- but I need to ignore commas in quotes  How can I do this  Seems like a regexp approach fails  I suppose I can manually scan and enter a different mode when I see a quote  but it would be nice to use preexisting libraries   edit  I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons    the above string should split into   foo bar c qual  baz blurb  d junk  quux syzygy    note  this is NOT a CSV file  it s a single string contained in a file with a larger overall structure

User · Answer

Try a lookaround like                This should match   that are not surrounded by

User · Answer

Try   public class Main        public static void main String   args            String line    foo bar c qual   baz blurb   d junk   quux syzygy             String   tokens   line split                                      -1           for String t   tokens                System out println   gt    t                       Output    gt  foo  gt  bar  gt  c qual  baz blurb   gt  d junk  quux syzygy    In other words  split on the comma only if that comma has zero  or an even number of quotes ahead of it   Or  a bit friendlier for the eyes   public class Main        public static void main String   args            String line    foo bar c qual   baz blurb   d junk   quux syzygy              String otherThanQuote                      String quotedString   String format       s        otherThanQuote           String regex   String format    x        enable comments  ignore white spaces                                                  match a comma                                                  start positive look ahead                                                    start non-capturing group 1                       s                              match  otherThanQuote  zero or more times                       s                              match  quotedString                                                     end group 1 and repeat it zero or more times                     s                              match  otherThanQuote                                                   match the end of the string                                                  stop positive look ahead                 otherThanQuote  quotedString  otherThanQuote            String   tokens   line split regex  -1           for String t   tokens                System out println   gt    t                       which produces the same as the first example   EDIT  As mentioned by  MikeFHay in the comments      I prefer using Guava s Splitter  as it has saner defaults  see discussion above about empty matches being trimmed by String split    so I did    Splitter on Pattern compile

User · Answer

Rather than use lookahead and other crazy regex  just pull out the quotes first  That is  for every quote grouping  replace that grouping with   IDENTIFIER 1 or some other indicator  and map that grouping to a map of string string   After you split on comma  replace all mapped identifiers with the original string values

User · Answer

I would do something like this   boolean foundQuote   false   if charAtIndex currentStringIndex               foundQuote   true     if foundQuote    true         do nothing    else       string   split   currentString split

User · Answer

You re in that annoying boundary area where regexps almost won t do  as has been pointed out by Bart  escaping the quotes would make life hard    and yet a full-blown parser seems like overkill    If you are likely to need greater complexity any time soon I would go looking for a parser library  For example this one

User · Answer

The simplest approach is not to match delimiters  i e  commas  with a complex additional logic to match what is actually intended  the data which might be quoted strings   just to exclude false delimiters  but rather match the intended data in the first place   The pattern consists of two alternatives  a quoted string          or        or everything up to the next comma          To support empty cells  we have to allow the unquoted item to be empty and to consume the next comma  if any  and use the   G anchor   Pattern p   Pattern compile    G                           The pattern also contains two capturing groups to get either  the quoted string   s content or the plain content   Then  with Java  9  we can get an array as  String   a   p matcher input  results        map m - gt  m group m start 1  lt 0  2  1        toArray String    new     whereas older Java versions need a loop like  for Matcher m   p matcher input   m find            String token   m group m start 1  lt 0  2  1       System out println  found    token       Adding the items to a List or an array is left as an excise to the reader   For Java  8  you can use the results   implementation of this answer  to do it like the Java  9 solution   For mixed content with embedded strings  like in the question  you can simply use  Pattern p   Pattern compile    G                           But then  the strings are kept in their quoted form

User · Answer

I would not advise a regex answer from Bart  I find parsing solution better in this particular case  as Fabian proposed   I ve tried regex solution and own parsing implementation I have found that    Parsing is much faster than splitting with regex with backreferences -  20 times faster for short strings   40 times faster for long strings  Regex fails to find empty string after last comma  That was not in original question though  it was mine requirement    My solution and test below   String tested    foo bar c qual   baz blurb   d junk   quux syzygy      long start   System nanoTime    String   tokens   tested split                                     long timeWithSplitting   System nanoTime   - start   start   System nanoTime     List lt String gt  tokensList   new ArrayList lt String gt     boolean inQuotes   false  StringBuilder b   new StringBuilder    for  char c   tested toCharArray          switch  c        case              if  inQuotes                b append c             else               tokensList add b toString                 b   new StringBuilder                      break      case               inQuotes    inQuotes      default          b append c       break          tokensList add b toString     long timeWithParsing   System nanoTime   - start   System out println Arrays toString tokens    System out println tokensList toString     System out printf  Time with splitting  t 10d n  timeWithSplitting   System out printf  Time with parsing  t 10d n  timeWithParsing     Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness  Note then lack of break after switch with separator  StringBuilder was chosen instead to StringBuffer by design to increase speed  where thread safety is irrelevant

User · Answer

what about a one-liner using String split     String s    foo bar c qual   baz blurb   d junk   quux syzygy     String   split   s split      lt      0 255

User · Answer

http   sourceforge net projects javacsv   https   github com pupi1985 JavaCSV-Reloaded  fork of the previous library that will allow the generated output to have Windows line terminators  r n when not running Windows   http   opencsv sourceforge net   CSV API for Java  Can you recommend a Java library for reading  and possibly writing  CSV files   Java lib or app to convert CSV to XML file

User · Answer

While I do like regular expressions in general  for this kind of state-dependent tokenization I believe a simple parser  which in this case is much simpler than that word might make it sound  is probably a cleaner solution  in particular with regards to maintainability  e g   String input    quot foo bar c qual   quot baz blurb  quot  d junk   quot quux syzygy  quot  quot   List lt String gt  result   new ArrayList lt String gt     int start   0  boolean inQuotes   false  for  int current   0  current  lt  input length    current          if  input charAt current        quot    inQuotes    inQuotes     toggle state     else if  input charAt current          amp  amp   inQuotes            result add input substring start  current            start   current   1          result add input substring start     If you don t care about preserving the commas inside the quotes you could simplify this approach  no handling of start index  no last character special case  by replacing your commas in quotes by something else and then split at commas  String input    quot foo bar c qual   quot baz blurb  quot  d junk   quot quux syzygy  quot  quot   StringBuilder builder   new StringBuilder input   boolean inQuotes   false  for  int currentIndex   0  currentIndex  lt  builder length    currentIndex          char currentChar   builder charAt currentIndex       if  currentChar       quot    inQuotes    inQuotes     toggle state     if  currentChar         amp  amp  inQuotes            builder setCharAt currentIndex           or      and replace later         List lt String gt  result   Arrays asList builder toString   split  quot   quot

User · Answer

I was impatient and chose not to wait for answers    for reference it doesn t look that hard to do something like this  which works for my application  I don t need to worry about escaped quotes  as the stuff in quotes is limited to a few constrained forms    final static private Pattern splitSearchPattern   Pattern compile            private List lt String gt  splitByCommasNotInQuotes String s        if  s    null          return Collections emptyList         List lt String gt  list   new ArrayList lt String gt         Matcher m   splitSearchPattern matcher s       int pos   0      boolean quoteMode   false      while  m find                  String sep   m group            if       equals sep                         quoteMode    quoteMode                    else if   quoteMode  amp  amp      equals sep                         int toPos   m start                 list add s substring pos  toPos                pos   m end                        if  pos  lt  s length            list add s substring pos        return list       exercise for the reader  extend to handling escaped quotes by looking for backslashes also

[java] Java: splitting a comma-separated string but ignoring commas in quotes

Examples related to java

Examples related to regex

Examples related to string