Parse large JSON file in Nodejs

Question

I have a file which stores many JavaScript objects in JSON form and I need to read the file  create each of the objects  and do something with them  insert them into a db in my case   The JavaScript objects can be represented a format   Format A     name   thing1          name   thing999999999      or Format B    name   thing1               lt    My choice       name   thing999999999     Note that the     indicates a lot of JSON objects  I am aware I could read the entire file into memory and then use JSON parse   like this   fs readFile filePath   utf-8   function  err  fileContents      if  err  throw err    console log JSON parse fileContents          However  the file could be really large  I would prefer to use a stream to accomplish this  The problem I see with a stream is that the file contents could be broken into data chunks at any point  so how can I use JSON parse   on such objects    Ideally  each object would be read as a separate data chunk  but I am not sure on how to do that   var importStream   fs createReadStream filePath   flags   r   encoding   utf-8     importStream on  data   function chunk         var pleaseBeAJSObject   JSON parse chunk                     insert pleaseBeAJSObject in a database     importStream on  end   function item       console log  Woot  imported objects into the database             Note  I wish to prevent reading the entire file into memory  Time efficiency does not matter to me  Yes  I could try to read a number of objects at once and insert them all at once  but that s a performance tweak - I need a way that is guaranteed not to cause a memory overload  not matter how many objects are contained in the file    I can choose to use FormatA or FormatB or maybe something else  just please specify in your answer  Thanks

User · Answer

Just as I was thinking that it would be fun to write a streaming JSON parser, I also thought that maybe I should do a quick search to see if there's one already available.

Turns out there is.

JSONStream "streaming JSON.parse and stringify"

Since I just found it, I've obviously not used it, so I can't comment on its quality, but I'll be interested to hear if it works.

It does work consider the following Javascript and _.isString:

stream.pipe(JSONStream.parse('*'))
  .on('data', (d) => {
    console.log(typeof d);
    console.log("isString: " + _.isString(d))
  });

This will log objects as they come in if the stream is an array of objects. Therefore the only thing being buffered is one object at a time.

User · Answer

I solved this problem using the split npm module   Pipe your stream into split  and it will  Break up a stream and reassemble it so that each line is a chunk    Sample code   var fs   require  fs       split   require  split        var stream   fs createReadStream filePath   flags   r   encoding   utf-8     var lineStream   stream pipe split     linestream on  data   function chunk        var json   JSON parse chunk

User · Answer

If you have control over the input file  and it s an array of objects  you can solve this more easily  Arrange to output the file with each record on one line  like this          key   value        key   value            This is still valid JSON   Then  use the node js readline module to process them one line at a time   var fs   require  fs     var lineReader   require  readline   createInterface       input  fs createReadStream  input txt        lineReader on  line   function  line        line   line trim         if  line charAt line length-1                     line   line substr 0  line length-1              if  line charAt 0                     processRecord JSON parse line               function processRecord record           Process the records one at a time here

User · Answer

I realize that you want to avoid reading the whole JSON file into memory if possible  however if you have the memory available it may not be a bad idea performance-wise   Using node js s require   on a json file loads the data into memory really fast     I ran two tests to see what the performance looked like on printing out an attribute from each feature from a 81MB geojson file     In the 1st test  I read the entire geojson file into memory using var data   require    geo json     That took 3330 milliseconds and then printing out an attribute from each feature took 804 milliseconds for a grand total of 4134 milliseconds   However  it appeared that node js was using 411MB of memory   In the second test  I used  arcseldon s answer with JSONStream   event-stream   I modified the JSONPath query to select only what I needed   This time the memory never went higher than 82MB  however  the whole thing now took 70 seconds to complete

User · Answer

To process a file line-by-line  you simply need to decouple the reading of the file and the code that acts upon that input   You can accomplish this by buffering your input until you hit a newline   Assuming we have one JSON object per line  basically  format B    var stream   fs createReadStream filePath   flags   r   encoding   utf-8     var buf        stream on  data   function d        buf    d toString       when data is read  stash it in a string buffer     pump       then process the buffer      function pump         var pos       while   pos   buf indexOf   n     gt   0       keep going while there s a newline somewhere in the buffer         if  pos    0       if there s more than one newline in a row  the buffer will now start with a newline             buf   buf slice 1      discard it             continue     so that the next iteration will start with data                   processLine buf slice 0 pos       hand off the line         buf   buf slice pos 1      and slice the processed data off the buffer          function processLine line       here s where we do something with a line      if  line line length-1       r   line line substr 0 line length-1      discard CR  0x0D       if  line length  gt  0       ignore empty lines         var obj   JSON parse line      parse the JSON         console log obj      do something with the data here            Each time the file stream receives data from the file system  it s stashed in a buffer  and then pump is called   If there s no newline in the buffer  pump simply returns without doing anything   More data  and potentially a newline  will be added to the buffer the next time the stream gets data  and then we ll have a complete object   If there is a newline  pump slices off the buffer from the beginning to the newline and hands it off to process   It then checks again if there s another newline in the buffer  the while loop    In this way  we can process all of the lines that were read in the current chunk   Finally  process is called once per input line   If present  it strips off the carriage return character  to avoid issues with line endings  ndash  LF vs CRLF   and then calls JSON parse one the line   At this point  you can do whatever you need to with your object   Note that JSON parse is strict about what it accepts as input  you must quote your identifiers and string values with double quotes   In other words   name  thing1   will throw an error  you must use   name   thing1     Because no more than a chunk of data will ever be in memory at a time  this will be extremely memory efficient   It will also be extremely fast   A quick test showed I processed 10 000 rows in under 15ms

User · Answer

I had similar requirement  i need to read a large json file in node js and process data in chunks and call a api and save in mongodb  inputFile json is like       customers               customer data                customer data                customer data                    Now i used JsonStream and EventStream to achieve this synchronously   var JSONStream   require  JSONStream    var es   require  event-stream     fileStream   fs createReadStream filePath    encoding   utf8      fileStream pipe JSONStream parse  customers      pipe    es through function data        console log  printing one customer object read from file           console log data       this pause        processOneCustomer data  this       return data          function end         console log  stream reading ended        this emit  end            function processOneCustomer data  es      DataModel save function err  dataModel        es resume

User · Answer

As of October 2014  you can just do something like the following  using JSONStream  - https   www npmjs org package JSONStream  var fs   require  fs        JSONStream   require  JSONStream     var getStream     function          var jsonData    myData json           stream   fs createReadStream jsonData    encoding   utf8              parser   JSONStream parse           return stream pipe parser      getStream   pipe MyTransformToDoWhateverProcessingAsNeeded  on  error   function  err           handle any errors       To demonstrate with a working example   npm install JSONStream event-stream   data json        greeting    hello world      hello js   var fs   require  fs        JSONStream   require  JSONStream        es   require  event-stream     var getStream   function          var jsonData    data json           stream   fs createReadStream jsonData    encoding   utf8              parser   JSONStream parse           return stream pipe parser       getStream        pipe es mapSync function  data            console log data                node hello js    hello world

User · Answer

I think you need to use a database  MongoDB is a good choice in this case because it is JSON compatible   UPDATE  You can use mongoimport tool to import JSON data into MongoDB   mongoimport --collection collection --file collection json

User · Answer

I wrote a module that can do this  called BFJ  Specifically  the method bfj match can be used to break up a large stream into discrete chunks of JSON   const bfj   require  bfj    const fs   require  fs     const stream   fs createReadStream filePath    bfj match stream   key  value  depth    gt  depth     0    ndjson  true       on  data   object   gt           do whatever you need to do with object         on  dataError   error   gt           a syntax error was found in the JSON         on  error   error   gt           some kind of operational error occurred         on  end   error   gt           finished processing the stream         Here  bfj match returns a readable  object-mode stream that will receive the parsed data items  and is passed 3 arguments    A readable stream containing the input JSON  A predicate that indicates which items from the parsed JSON will be pushed to the result stream  An options object indicating that the input is newline-delimited JSON  this is to process format B from the question  it s not required for format A     Upon being called  bfj match will parse JSON from the input stream depth-first  calling the predicate with each value to determine whether or not to push that item to the result stream  The predicate is passed three arguments    The property key or array index  this will be undefined for top-level items   The value itself  The depth of the item in the JSON structure  zero for top-level items     Of course a more complex predicate can also be used as necessary according to requirements  You can also pass a string or a regular expression instead of a predicate function  if you want to perform simple matches against property keys

[javascript] Parse large JSON file in Nodejs

Examples related to javascript

Examples related to json

Examples related to file

Examples related to node.js