[java] Text File Parsing in Java

I am reading in a text file using FileInputStream that puts the file contents into a byte array. I then convert the byte array into a String using new String(byte).

Once I have the string I'm using String.split("\n") to split the file into a String array and then taking that string array and parsing it by doing a String.split(",") and hold the contents in an Arraylist.

I have a 200MB+ file and it is running out of memory when I start the JVM up with a 1GB of memory. I know I must be doing something in correctly somewhere, I'm just not sure if the way I'm parsing is incorrect or the data structure I'm using.

It is also taking me about 12 seconds to parse the file which seems like a lot of time. Can anyone point out what I may be doing that is causing me to run out of memory and what may be causing my program to run slow?

The contents of the file look as shown below:

"12334", "100", "1.233", "TEST", "TEXT", "1234"
"12334", "100", "1.233", "TEST", "TEXT", "1234"
.
.
.
"12334", "100", "1.233", "TEST", "TEXT", "1234"

Thanks

This question is related to java file parsing

The answer is


I'm not sure how efficient it is memory-wise, but my first approach would be using a Scanner as it is incredibly easy to use:

File file = new File("/path/to/my/file.txt");
Scanner input = new Scanner(file);

while(input.hasNext()) {
    String nextToken = input.next();
    //or to process line by line
    String nextLine = input.nextLine();
}

input.close();

Check the API for how to alter the delimiter it uses to split tokens.


While calling/invoking your programme you can use this command : java [-options] className [args...]
in place of [-options] provide more memory e.g -Xmx1024m or more. but this is just a workaround, u have to change ur parsing mechanism.


Have a look at these pages. They contain many open source CSV parsers. JSaPar is one of them.


If you have a 200,000,000 character files and split that every five characters, you have 40,000,000 String objects. Assume they are sharing actual character data with the original 400 MB String (char is 2 bytes). A String is say 32 bytes, so that is 1,280,000,000 bytes of String objects.

(It's probably worth noting that this is very implementation dependent. split could create entirely strings with entirely new backing char[] or, OTOH, share some common String values. Some Java implementations to not use the slicing of char[]. Some may use a UTF-8-like compact form and give very poor random access times.)

Even assuming longer strings, that's a lot of objects. With that much data, you probably want to work with most of it in compact form like the original (only with indexes). Only convert to objects that which you need. The implementation should be database like (although they traditionally don't handle variable length strings efficiently).


It sounds like you currently have 3 copies of the entire file in memory: the byte array, the string, and the array of the lines.

Instead of reading the bytes into a byte array and then converting to characters using new String() it would be better to use an InputStreamReader, which will convert to characters incrementally, rather than all up-front.

Also, instead of using String.split("\n") to get the individual lines, you should read one line at a time. You can use the readLine() method in BufferedReader.

Try something like this:

BufferedReader reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"));
try {
  while (true) {
    String line = reader.readLine();
    if (line == null) break;
    String[] fields = line.split(",");
    // process fields here
  }
} finally {
  reader.close();
}

Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to file

Gradle - Move a folder from ABC to XYZ Difference between opening a file in binary vs text Angular: How to download a file from HttpClient? Python error message io.UnsupportedOperation: not readable java.io.FileNotFoundException: class path resource cannot be opened because it does not exist Writing JSON object to a JSON file with fs.writeFileSync How to read/write files in .Net Core? How to write to a CSV line by line? Writing a dictionary to a text file? What are the pros and cons of parquet format compared to other formats?

Examples related to parsing

Got a NumberFormatException while trying to parse a text file for objects Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) Python/Json:Expecting property name enclosed in double quotes Correctly Parsing JSON in Swift 3 How to get response as String using retrofit without using GSON or any other library in android UIButton action in table view cell "Expected BEGIN_OBJECT but was STRING at line 1 column 1" How to convert an XML file to nice pandas dataframe? How to extract multiple JSON objects from one file? How to sum digits of an integer in java?