Text File Parsing in Java

Question

I am reading in a text file using FileInputStream that puts the file contents into a byte array   I then convert the byte array into a String using new String byte      Once I have the string I m using String split   n   to split the file into a String array and then taking that string array and parsing it by doing a String split      and hold the contents in an Arraylist     I have a 200MB  file and it is running out of memory when I start the JVM up with a 1GB of memory   I know I must be doing something in correctly somewhere  I m just not sure if the way I m parsing is incorrect or the data structure I m using     It is also taking me about 12 seconds to parse the file which seems like a lot of time   Can anyone point out what I may be doing that is causing me to run out of memory and what may be causing my program to run slow   The contents of the file look as shown below    12334    100    1 233    TEST    TEXT    1234   12334    100    1 233    TEST    TEXT    1234         12334    100    1 233    TEST    TEXT    1234    Thanks

User · Accepted Answer

It sounds like you re doing something wrong to me - a whole lotta object creation going on   How representative is that  test  file   What are you really doing with that data   If that s typical of what you really have  I d say there s lots of repetition in that data   If it s all going to be in Strings anyway  start with a BufferedReader to read each line   Pre-allocate that List to a size that s close to what you need so you don t waste resources adding to it each time  Split each of those lines at the comma  be sure to strip off the double quotes   You might want to ask yourself   Why do I need this whole file in memory all at once    Can you read a little  process a little  and never have the whole thing in memory at once   Only you know your problem well enough to answer   Maybe you can fire up jvisualvm if you have JDK 6 and see what s going on with memory   That would be a great clue

User · Answer

I m not sure how efficient it is memory-wise  but my first approach would be using a Scanner as it is incredibly easy to use   File file   new File   path to my file txt    Scanner input   new Scanner file    while input hasNext          String nextToken   input next          or to process line by line     String nextLine   input nextLine       input close      Check the API for how to alter the delimiter it uses to split tokens

User · Answer

It sounds like you currently have 3 copies of the entire file in memory  the byte array  the string  and the array of the lines   Instead of reading the bytes into a byte array and then converting to characters using new String   it would be better to use an InputStreamReader  which will convert to characters incrementally  rather than all up-front   Also  instead of using String split   n   to get the individual lines  you should read one line at a time  You can use the readLine   method in BufferedReader   Try something like this   BufferedReader reader   new BufferedReader new InputStreamReader fileInputStream   UTF-8     try     while  true        String line   reader readLine        if  line    null  break      String   fields   line split              process fields here       finally     reader close

User · Answer

If you have a 200 000 000 character files and split that every five characters  you have 40 000 000 String objects  Assume they are sharing actual character data with the original 400 MB String  char is 2 bytes   A String is say 32 bytes  so that is 1 280 000 000 bytes of String objects    It s probably worth noting that this is very implementation dependent  split could create entirely strings with entirely new backing char   or  OTOH  share some common String values  Some Java implementations to not use the slicing of char    Some may use a UTF-8-like compact form and give very poor random access times    Even assuming longer strings  that s a lot of objects  With that much data  you probably want to work with most of it in compact form like the original  only with indexes   Only convert to objects that which you need  The implementation should be database like  although they traditionally don t handle variable length strings efficiently

User · Answer

Have a look at these pages  They contain many open source CSV parsers  JSaPar is one of them    Text file parsing libraries and projects Java Open Source libraries

User · Answer

While calling invoking your programme you can use this command   java  -options  className  args     in place of  -options  provide more memory e g -Xmx1024m or more  but this is just a workaround  u have to change ur parsing mechanism

[java] Text File Parsing in Java

Examples related to java

Examples related to file

Examples related to parsing