Parsing huge logfiles in Node js - read in line-by-line

Question

I need to do some parsing of large  5-10 Gb logfiles in Javascript Node js  I m using Cube    The logline looks something like   10 00 43 343423 I m a friendly log message  There are 5 cats  and 7 dogs  We are in state  SUCCESS     We need to read each line  do some parsing  e g  strip out 5  7 and SUCCESS   then pump this data into Cube  https   github com square cube  using their JS client   Firstly  what is the canonical way in Node to read in a file  line by line   It seems to be fairly common question online    http   www quora com What-is-the-best-way-to-read-a-file-line-by-line-in-node-js Read a file one line at a time in node js    A lot of the answers seem to point to a bunch of third-party modules    https   github com nickewing line-reader https   github com jahewson node-byline https   github com pkrumins node-lazy https   github com Gagle Node-BufferedReader   However  this seems like a fairly basic task - surely  there s a simple way within the stdlib to read in a textfile  line-by-line   Secondly  I then need to process each line  e g  convert the timestamp into a Date object  and extract useful fields    What s the best way to do this  maximising throughput  Is there some way that won t block on either reading in each line  or on sending it to Cube   Thirdly - I m guessing using string splits  and the JS equivalent of contains  IndexOf    -1   will be a lot faster than regexes  Has anybody had much experience in parsing massive amounts of text data in Node js   Cheers  Victor

User · Answer

node-byline uses streams  so i would prefer that one for your huge files    for your date-conversions i would use moment js    for maximising your throughput you could think about using a software-cluster  there are some nice-modules which wrap the node-native cluster-module quite well  i like cluster-master from isaacs  e g  you could create a cluster of x workers which all compute a file    for benchmarking splits vs regexes use benchmark js  i havent tested it until now  benchmark js is available as a node-module

User · Answer

import   as csv from  fast-csv   import   as fs from  fs   interface Row      s  string   string    type RowCallBack    data  Row  index  number    gt  object  export class CSVReader     protected file  string    protected csvOptions         delimiter           headers  true      ignoreEmpty  true      trim  true        constructor file  string  csvOptions             if   fs existsSync file           throw new Error  File   file  not found               this file   file      this csvOptions   Object assign     this csvOptions  csvOptions         public read callback  RowCallBack   Promise  lt  Array  lt  object  gt  gt        return new Promise  lt  Array  lt  object  gt  gt   resolve   gt          const readStream   fs createReadStream this file         const results  Array  lt  any  gt              let index   0        const csvStream   csv parse this csvOptions  on  data   async  data  Row    gt            index            results push await callback data  index             on  error    err  Error    gt            console error err message           throw err           on  end        gt            resolve results                   readStream pipe csvStream                   import   CSVReader   from     src helpers CSVReader    async      gt      const reader   new CSVReader    database migrations csv users csv      const users   await reader read async data   gt        return         username  data username        name  data name        email  data email        cellPhone  data cell phone        homePhone  data home phone        roleId  data role id        description  data description        state  data state                 console log users

User · Answer

I used https   www npmjs com package line-by-line for reading more than 1 000 000 lines from a text file  In this case  an occupied capacity of RAM was about 50-60 megabyte       const LineByLineReader   require  line-by-line        lr   new LineByLineReader  big file txt         lr on  error   function  err                 err  contains error object              lr on  line   function  line               pause emitting of lines            lr pause                   do your asynchronous line processing           setTimeout function                        and continue emitting lines              lr resume               100                lr on  end   function                  All lines are read  file is closed now

User · Answer

You can use the inbuilt readline package  see docs here  I use stream to create a new output stream   var fs   require  fs        readline   require  readline        stream   require  stream     var instream   fs createReadStream   path to file    var outstream   new stream  outstream readable   true  outstream writable   true   var rl   readline createInterface       input  instream      output  outstream      terminal  false      rl on  line   function line        console log line         Do your stuff           Then write to outstream     rl write cubestuff         Large files will take some time to process  Do tell if it works

User · Answer

I had the same problem yet  After comparing several modules that seem to have this feature  I decided to do it myself  it s simpler than I thought   gist  https   gist github com deemstone 8279565   var fetchBlock   lineByline filepath  onEnd   fetchBlock function lines  start              lines array  start int  lines 0  No    It cover the file opened in a closure  that fetchBlock   returned will fetch a block from the file  end split to array  will deal the segment from last fetch     I ve set the block size to 1024 for each read operation  This may have bugs  but code logic is obvious  try it yourself

User · Answer

I really liked  gerard answer which is actually deserves to be the correct answer here  I made some improvements    Code is in a class  modular  Parsing is included Ability to resume is given to the outside in case there is an asynchronous job is chained to reading the CSV like inserting to DB  or a HTTP request Reading in chunks batche sizes that user can declare  I took care of encoding in the stream too  in case you have files in different encoding    Here s the code    use strict   const fs   require  fs        util   require  util        stream   require  stream        es   require  event-stream        parse   require  csv-parse        iconv   require  iconv-lite     class CSVReader     constructor filename  batchSize  columns        this reader   fs createReadStream filename  pipe iconv decodeStream  utf8        this batchSize   batchSize    1000     this lineNumber   0     this data          this parseOptions    delimiter    t   columns  true  escape       relax  true         read callback        this reader        pipe es split           pipe es mapSync line   gt              this lineNumber          parse line  this parseOptions   err  d    gt              this data push d 0                       if  this lineNumber   this batchSize     0              callback this data                            on  error   function              console log  Error while reading file                    on  end   function              console log  Read entirefile                     continue          this data          this reader resume          module exports   CSVReader   So basically  here is how you will use it   let reader   CSVReader  path to file csv   reader read      gt  reader continue      I tested this with a 35GB CSV file and it worked for me and that s why I chose to build it on   gerard s answer  feedbacks are welcomed

User · Answer

The Node js Documentation offers a very elegant example using the Readline module   Example  Read File Stream Line-by-Line  const fs   require  fs    const readline   require  readline     const rl   readline createInterface       input  fs createReadStream  sample txt        crlfDelay  Infinity      rl on  line    line    gt        console log  Line from file    line              Note  we use the crlfDelay option to recognize all instances of CR LF    r n   as a single line break

User · Answer

Based on this questions answer I implemented a class you can use to read a file synchronously line-by-line with fs readSync    You can make this  pause  and  resume  by using a Q promise  jQuery seems to require a DOM so cant run it with nodejs    var fs   require  fs    var Q   require  q     var lr   new LineReader filenameToLoad   lr open     var promise  workOnLine   function          var line   lr readNextLine        promise   complexLineTransformation line  then          function    console log  ok   workOnLine              function    console log  error              workOnLine     complexLineTransformation   function  line        var deferred   Q defer               async call goes here  in callback  deferred resolve  done ok    or deferred reject new Error error        return deferred promise     function LineReader  filename            this moreLinesAvailable   true    this fd   undefined    this bufferSize   1024 1024    this buffer   new Buffer this bufferSize     this leftOver          this read   undefined    this idxStart   undefined    this idx   undefined     this lineNumber   0     this  bundleOfLines          this open   function         this fd   fs openSync filename   r            this readNextLine   function          if  this  bundleOfLines length     0          this  readNextBundleOfLines              this lineNumber        var lineToReturn   this  bundleOfLines 0       this  bundleOfLines splice 0  1      remove first element  pos  howmany      return lineToReturn          this getLineNumber   function         return this lineNumber          this  readNextBundleOfLines   function         var line           while   this read   fs readSync this fd  this buffer  0  this bufferSize  null       0       read next bytes until end of file       this leftOver    this buffer toString  utf8   0  this read      append to leftOver       this idxStart   0       while   this idx   this leftOver indexOf   n   this idxStart       -1       as long as there is a newline-char in leftOver         line   this leftOver substring this idxStart  this idx           this  bundleOfLines push line                   this idxStart   this idx   1                this leftOver   this leftOver substring this idxStart         if  line                   break

User · Answer

I have made a node module to read large file asynchronously text or JSON  Tested on large files   var fs   require  fs     util   require  util     stream   require  stream     es   require  event-stream     module exports   FileReader   function FileReader        FileReader prototype read   function pathToFile  callback       var returnTxt           var s   fs createReadStream pathToFile       pipe es split         pipe es mapSync function line               pause the readstream         s pause               console log  reading line    line           returnTxt    line                      resume the readstream  possibly from a callback         s resume                on  error   function            console log  Error while reading file                 on  end   function            console log  Read entire file             callback returnTxt                 FileReader prototype readJSON   function pathToFile  callback       try          this read pathToFile  function txt  callback JSON parse txt                 catch err           throw new Error  json file is not valid    err stack              Just save the file as file-reader js  and use it like this   var FileReader   require    file-reader    var fileReader   new FileReader    fileReader readJSON   dirname     largeFile json   function jsonObj    callback logic here

User · Answer

Apart from read the big file line by line  you also can read it chunk by chunk  For more refer to this article  var offset   0  var chunkSize   2048  var chunkBuffer   new Buffer chunkSize   var fp   fs openSync  filepath    r    var bytesRead   0  while bytesRead   fs readSync fp  chunkBuffer  0  chunkSize  offset         offset    bytesRead      var str   chunkBuffer slice 0  bytesRead  toString        var arr   str split   n         if bytesRead   chunkSize               the last item of the arr may be not a full line  leave it to the next chunk         offset -  arr pop   length            lines push arr     console log lines

User · Answer

I searched for a solution to parse very large files  gbs  line by line using a stream  All the third-party libraries and examples did not suit my needs since they processed the files not line by line  like 1   2   3   4     or read the entire file to memory  The following solution can parse very large files  line by line using stream  amp  pipe  For testing I used a 2 1 gb file with 17 000 000 records  Ram usage did not exceed 60 mb   First  install the event-stream package   npm install event-stream   Then   var fs   require  fs         es   require  event-stream     var lineNr   0   var s   fs createReadStream  very-large-file csv        pipe es split         pipe es mapSync function line               pause the readstream         s pause             lineNr    1              process line here and call s resume   when rdy            function below was for logging memory usage         logMemoryUsage lineNr               resume the readstream  possibly from a callback         s resume                on  error   function err           console log  Error while reading file    err               on  end   function            console log  Read entire file                  Please let me know how it goes

[node.js] Parsing huge logfiles in Node.js - read in line-by-line

Examples related to node.js

Examples related to parsing

Examples related to logfile-analysis