[java] Read large files in Java

I need the advice from someone who knows Java very well and the memory issues. I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.

I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.

My file contains text, it is not binary and I have about 20 character per line.

This question is related to java memory-management file

The answer is


You can consider using memory-mapped files, via FileChannels .

Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.

Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness


Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.


Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.


Don't use read without arguments. It's very slow. Better read it to buffer and move it to file quickly.

Use bufferedInputStream because it supports binary reading.

And it's all.


To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.

It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.

Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.

Kickoff example:

String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;

try {
    reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
    int count = 0;
    for (String line; (line = reader.readLine()) != null;) {
        if (count++ % maxlines == 0) {
            close(writer);
            writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
        }
        writer.write(line);
        writer.newLine();
    }
} finally {
    close(writer);
    close(reader);
}

This is a very good article: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

In summary, for great performance, you should:

  1. Avoid accessing the disk.
  2. Avoid accessing the underlying operating system.
  3. Avoid method calls.
  4. Avoid processing bytes and characters individually.

For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.


Yes. I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file (Eg : read(buffer,0,buffer.length))

And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.


_x000D_
_x000D_
package all.is.well;_x000D_
import java.io.IOException;_x000D_
import java.io.RandomAccessFile;_x000D_
import java.util.concurrent.ExecutorService;_x000D_
import java.util.concurrent.Executors;_x000D_
import junit.framework.TestCase;_x000D_
_x000D_
/**_x000D_
 * @author Naresh Bhabat_x000D_
 * _x000D_
Following  implementation helps to deal with extra large files in java._x000D_
This program is tested for dealing with 2GB input file._x000D_
There are some points where extra logic can be added in future._x000D_
_x000D_
_x000D_
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object._x000D_
_x000D_
_x000D_
_x000D_
It uses random access file,which is almost like streaming API._x000D_
_x000D_
_x000D_
 * ****************************************_x000D_
Notes regarding executor framework and its readings._x000D_
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);_x000D_
_x000D_
 *      for 10 threads:Total time required for reading and writing the text in_x000D_
 *         :seconds 349.317_x000D_
 * _x000D_
 *         For 100:Total time required for reading the text and writing   : seconds 464.042_x000D_
 * _x000D_
 *         For 1000 : Total time required for reading and writing text :466.538 _x000D_
 *         For 10000  Total time required for reading and writing in seconds 479.701_x000D_
 *_x000D_
 * _x000D_
 */_x000D_
public class DealWithHugeRecordsinFile extends TestCase {_x000D_
_x000D_
 static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";_x000D_
 static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";_x000D_
 static volatile RandomAccessFile fileToWrite;_x000D_
 static volatile RandomAccessFile file;_x000D_
 static volatile String fileContentsIter;_x000D_
 static volatile int position = 0;_x000D_
_x000D_
 public static void main(String[] args) throws IOException, InterruptedException {_x000D_
  long currentTimeMillis = System.currentTimeMillis();_x000D_
_x000D_
  try {_x000D_
   fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles _x000D_
   file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles _x000D_
   seriouslyReadProcessAndWriteAsynch();_x000D_
_x000D_
  } catch (IOException e) {_x000D_
   // TODO Auto-generated catch block_x000D_
   e.printStackTrace();_x000D_
  }_x000D_
  Thread currentThread = Thread.currentThread();_x000D_
  System.out.println(currentThread.getName());_x000D_
  long currentTimeMillis2 = System.currentTimeMillis();_x000D_
  double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;_x000D_
  System.out.println("Total time required for reading the text in seconds " + time_seconds);_x000D_
_x000D_
 }_x000D_
_x000D_
 /**_x000D_
  * @throws IOException_x000D_
  * Something  asynchronously serious_x000D_
  */_x000D_
 public static void seriouslyReadProcessAndWriteAsynch() throws IOException {_x000D_
  ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class_x000D_
  while (true) {_x000D_
   String readLine = file.readLine();_x000D_
   if (readLine == null) {_x000D_
    break;_x000D_
   }_x000D_
   Runnable genuineWorker = new Runnable() {_x000D_
    @Override_x000D_
    public void run() {_x000D_
     // do hard processing here in this thread,i have consumed_x000D_
     // some time and ignore some exception in write method._x000D_
     writeToFile(FILEPATH_WRITE, readLine);_x000D_
     // System.out.println(" :" +_x000D_
     // Thread.currentThread().getName());_x000D_
_x000D_
    }_x000D_
   };_x000D_
   executor.execute(genuineWorker);_x000D_
  }_x000D_
  executor.shutdown();_x000D_
  while (!executor.isTerminated()) {_x000D_
  }_x000D_
  System.out.println("Finished all threads");_x000D_
  file.close();_x000D_
  fileToWrite.close();_x000D_
 }_x000D_
_x000D_
 /**_x000D_
  * @param filePath_x000D_
  * @param data_x000D_
  * @param position_x000D_
  */_x000D_
 private static void writeToFile(String filePath, String data) {_x000D_
  try {_x000D_
   // fileToWrite.seek(position);_x000D_
   data = "\n" + data;_x000D_
   if (!data.contains("Randomization")) {_x000D_
    return;_x000D_
   }_x000D_
   System.out.println("Let us do something time consuming to make this thread busy"+(position++) + "   :" + data);_x000D_
   System.out.println("Lets consume through this loop");_x000D_
   int i=1000;_x000D_
   while(i>0){_x000D_
   _x000D_
    i--;_x000D_
   }_x000D_
   fileToWrite.write(data.getBytes());_x000D_
   throw new Exception();_x000D_
  } catch (Exception exception) {_x000D_
   System.out.println("exception was thrown but still we are able to proceeed further"_x000D_
     + " \n This can be used for marking failure of the records");_x000D_
   //exception.printStackTrace();_x000D_
_x000D_
  }_x000D_
_x000D_
 }_x000D_
}
_x000D_
_x000D_
_x000D_


You can use java.nio which is faster than classical Input/Output stream:

http://java.sun.com/javase/6/docs/technotes/guides/io/index.html


Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to memory-management

When to create variables (memory management) How to check if pytorch is using the GPU? How to delete multiple pandas (python) dataframes from memory to save RAM? Is there a way to delete created variables, functions, etc from the memory of the interpreter? C++ error : terminate called after throwing an instance of 'std::bad_alloc' How to delete object? Android Studio - How to increase Allocated Heap Size Implementing IDisposable correctly Calculating Page Table Size Pointer-to-pointer dynamic two-dimensional array

Examples related to file

Gradle - Move a folder from ABC to XYZ Difference between opening a file in binary vs text Angular: How to download a file from HttpClient? Python error message io.UnsupportedOperation: not readable java.io.FileNotFoundException: class path resource cannot be opened because it does not exist Writing JSON object to a JSON file with fs.writeFileSync How to read/write files in .Net Core? How to write to a CSV line by line? Writing a dictionary to a text file? What are the pros and cons of parquet format compared to other formats?