How can I split a text file using PowerShell

Question

I need to split a large  500 nbsp MB  text file  a log4net exception file  into manageable chunks like 100 5 nbsp MB files would be fine   I would think this should be a walk in the park for PowerShell  How can I do it

User · Accepted Answer

This is a somewhat easy task for PowerShell  complicated by the fact that the standard Get-Content cmdlet doesn t handle very large files too well   What I would suggest to do is use the  NET StreamReader class to read the file line by line in your PowerShell script and use the Add-Content cmdlet to write each line to a file with an ever-increasing index in the filename   Something like this    upperBound   50MB   calculated by Powershell  ext    log   rootName    log     reader   new-object System IO StreamReader  C  Exceptions log    count   1  fileName     0  1   2   -f   rootName   count   ext  while   line    reader ReadLine    -ne  null        Add-Content -path  fileName -value  line     if  Get-ChildItem -path  fileName  Length -ge  upperBound                   count          fileName     0  1   2   -f   rootName   count   ext            reader Close

User · Answer

My requirement was a bit different  I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data  And they re really big  so I need to split them into manageable parts  whilst preserving the header row    So  I reverted back to my classic VBScript method and bashed together a small  vbs script that can be run on any Windows computer  it gets automatically executed by the WScript exe script host engine on Window    The benefit of this method is that it uses Text Streams  so the underlying data isn t loaded into memory  or  at least  not all at once   The result is that it s exceptionally fast and it doesn t really need much memory to run  The test file I just split using this script on my i7 was about 1 GB in file size  had about 12 million lines of text and was split into 25 part files  each with about 500k lines each      the processing took about 2 minutes and it didn   t go over 3 MB memory used at any point   The caveat here is that it relies on the text file having  lines   meaning each record is delimited with a CRLF  as the Text Stream object uses the  ReadLine  function to process a single line at a time  But hey  if you re working with TSV or CSV files  it s perfect   Option Explicit  Private Const INPUT TEXT FILE    c  bigtextfile txt    Private Const REPEAT HEADER ROW   True                 Private Const LINES PER PART   500000                   Dim oFileSystem  oInputFile  oOutputFile  iOutputFile  iLineCounter  sHeaderLine  sLine  sFileExt  sStart  sStart   Now    sFileExt   Right INPUT TEXT FILE Len INPUT TEXT FILE -InstrRev INPUT TEXT FILE      1  iLineCounter   0 iOutputFile   1  Set oFileSystem   CreateObject  Scripting FileSystemObject   Set oInputFile   oFileSystem OpenTextFile INPUT TEXT FILE  1  False  Set oOutputFile   oFileSystem OpenTextFile Replace INPUT TEXT FILE  sFileExt       amp  iOutputFile  amp  sFileExt   2  True   If REPEAT HEADER ROW Then     iLineCounter   1     sHeaderLine   oInputFile ReadLine       Call oOutputFile WriteLine sHeaderLine  End If  Do While Not oInputFile AtEndOfStream     sLine   oInputFile ReadLine       Call oOutputFile WriteLine sLine      iLineCounter   iLineCounter   1     If iLineCounter Mod LINES PER PART   0 Then         iOutputFile   iOutputFile   1         Call oOutputFile Close           Set oOutputFile   oFileSystem OpenTextFile Replace INPUT TEXT FILE  sFileExt       amp  iOutputFile  amp  sFileExt   2  True          If REPEAT HEADER ROW Then             Call oOutputFile WriteLine sHeaderLine          End If     End If Loop  Call oInputFile Close   Call oOutputFile Close   Set oFileSystem   Nothing  Call MsgBox  Done   amp  vbCrLf  amp   Lines Processed    amp  iLineCounter  amp  vbCrLf  amp   Part Files     amp  iOutputFile  amp  vbCrLf  amp   Start Time     amp  sStart  amp  vbCrLf  amp   Finish Time     amp  Now

User · Answer

I ve made a little modification to split files based on size of each part                                                                                    SYNOPSIS   Breaks a text file into multiple text files in a destination  where each   file contains a maximum number of lines      DESCRIPTION   When working with files that have a header  it is often desirable to have   the header information repeated in all of the split files  Split-File   supports this functionality with the -rc  RepeatCount  parameter      PARAMETER Path   Specifies the path to an item  Wildcards are permitted      PARAMETER LiteralPath   Specifies the path to an item  Unlike Path  the value of LiteralPath is   used exactly as it is typed  No characters are interpreted as wildcards    If the path includes escape characters  enclose it in single quotation marks    Single quotation marks tell Windows PowerShell not to interpret any   characters as escape sequences      PARAMETER Destination    Or -d  The location in which to place the chunked output files      PARAMETER Size    Or -s  The maximum size of each file  Size must be expressed in MB      PARAMETER RepeatCount    Or -rc  Specifies the number of  header  lines from the input file that will   be repeated in each output file  Typically this is 0 or 1 but it can be any   number of lines      EXAMPLE   Split-File bigfile csv -s 20 -rc 1     LINK    Out-TempFile                                                                                function Split-File         CmdletBinding DefaultParameterSetName  Path        param            Parameter ParameterSetName  Path   Position 1  Mandatory  true  ValueFromPipeline  true  ValueFromPipelineByPropertyName  true            String    Path            Alias  PSPath             Parameter ParameterSetName  LiteralPath   Mandatory  true  ValueFromPipelineByPropertyName  true            String    LiteralPath            Alias  s             Parameter Position 2 Mandatory  true            Int32  Size            Alias  d             Parameter Position 3            String  Destination                Alias  rc             Parameter             Int32  RepeatCount             process        yeah  the cmdlet supports wildcards         if   LiteralPath     ResolveArgs     LiteralPath  LiteralPath            elseif   Path     ResolveArgs     Path  Path             Resolve-Path  ResolveArgs                    InputName    IO Path   GetFileNameWithoutExtension                  InputExt     IO Path   GetExtension                  if   RepeatCount     Header   Get-Content    -TotalCount  RepeatCount       Resolve-Path  ResolveArgs            InputName    IO Path   GetFileNameWithoutExtension          InputExt     IO Path   GetExtension          if   RepeatCount     Header   Get-Content    -TotalCount  RepeatCount          get the input file in manageable chunks       Part   1      buffer          Get-Content    -ReadCount 1              make an output filename with a suffix       OutputFile   Join-Path  Destination    0 - 1 0000  2   -f   InputName  Part  InputExt           In the first iteration the header will be        copied to the output file as usual        on subsequent iterations we have to do it      if   RepeatCount -and  Part -gt 1          Set-Content  OutputFile  Header                test buffer size and dump data only if buffer is greater than size      if   buffer length -gt   Size   1MB             write this chunk to the output file       Write-Host  Writing  OutputFile        Add-Content  OutputFile  buffer        Part    1        buffer             else          buffer           r

User · Answer

There s also this quick  and somewhat dirty  one-liner    linecount 0   i 0  Get-Content   BIG LOG FILE txt      Add-Content OUT i log        linecount    if   linecount -eq 3000    I     linecount 0       You can tweak the number of first lines per batch by changing the hard-coded 3000 value

User · Answer

As the lines can be variable in logs I thought it best to take a number of lines per file approach   The following code snippet processed a 4 million line log file in under 19 seconds  18 83   seconds splitting it into 500 000 line chunks    sourceFile    c  myfolder mylargeTextyFile csv   partNumber   1  batchSize   500000  pathAndFilename    c  myfolder mylargeTextyFile part  partNumber file csv    System Text Encoding  enc    System Text Encoding   GetEncoding 65001     utf8 this one   fs New-Object System IO FileStream   sourceFile  OpenOrCreate    Read    ReadWrite  8  None     streamIn New-Object System IO StreamReader  fs   enc   streamout   new-object System IO StreamWriter  pathAndFilename   line    streamIn readline    counter   0 while   line -ne  null         streamout writeline  line       counter   1     if   counter -eq  batchsize                 partNumber  1          counter  0          streamOut close            pathAndFilename    c  myfolder mylargeTextyFile part  partNumber file csv           streamout   new-object System IO StreamWriter  pathAndFilename             line    streamIn readline      streamin close    streamout close     This can easily be turned into a function or script file with parameters to make it more versatile   It uses a StreamReader and StreamWriter to achieve its speed and tiny memory footprint

User · Answer

Sounds like a job for the UNIX command split   split MyBigFile csv   Just split my 55 GB csv file in 21k chunks in less than 10 minutes   It s not native to PowerShell though  but comes with  for instance  the git for windows package https   git-scm com download win

User · Answer

Simple one-liner to split based on number of lines  100 in this case     i 0  Get-Content      log -ReadCount 100      i         Out-File out  i txt

User · Answer

Same as all the answers here  but using StreamReader StreamWriter to split on new lines  line by line  instead of trying to read the whole file into memory at once   This approach can split big files in the fastest way I know of   Note  I do very little error checking  so I can t guarantee it ll work smoothly for your case  It did for mine  1 7 nbsp GB TXT file of 4 million lines split in 100 000 lines per file in 95 seconds     split test  sw   new-object System Diagnostics Stopwatch  sw Start    filename    C  Users Vincent Desktop test txt   rootName    C  Users Vincent Desktop result   ext     txt    linesperFile   100000 100k  filecount   1  reader    null try       reader    io file   OpenText  filename      try           Creating file number  filecount           writer    io file   CreateText   0  1   2   -f   rootName  filecount ToString  000    ext            filecount            linecount   0          while  reader EndOfStream -ne  true                 Reading  linesperFile              while    linecount -lt  linesperFile  -and   reader EndOfStream -ne  true                     writer WriteLine  reader ReadLine                      linecount                              if  reader EndOfStream -ne  true                     Closing file                   writer Dispose                      Creating file number  filecount                   writer    io file   CreateText   0  1   2   -f   rootName  filecount ToString  000    ext                    filecount                    linecount   0                               finally            writer Dispose            finally        reader Dispose       sw Stop    Write-Host  Split complete in    sw Elapsed TotalSeconds  seconds    Output splitting a 1 7 nbsp GB file       Creating file number 45 Reading 100000 Closing file Creating file number 46 Reading 100000 Closing file Creating file number 47 Reading 100000 Closing file Creating file number 48 Reading 100000 Split complete in  95 6308289 seconds

User · Answer

Do this   FILE 1  There s also this quick  and somewhat dirty  one-liner        linecount 0   i 0       Get-Content   BIG LOG FILE txt                  Add-Content OUT i log               linecount           if   linecount -eq 3000    I     linecount 0            You can tweak the number of first lines per batch by changing the hard-coded 3000 value   Get-Content C  TEMP DATA split splitme txt   Select -First 5000   out-File C  temp file1 txt -Encoding ASCII   FILE 2  Get-Content C  TEMP DATA split splitme txt   Select -Skip 5000   Select -First 5000   out-File C  temp file2 txt -Encoding ASCII   FILE 3  Get-Content C  TEMP DATA split splitme txt   Select -Skip 10000   Select -First 5000   out-File C  temp file3 txt -Encoding ASCII   etc

User · Answer

Here is my solution to split a file called patch6 txt  about 32 000 lines  into separate files of 1000 lines each  Its not quick  but it does the job    infile    D  Malcolm Test patch6 txt   path    D  Malcolm Test    lineCount   1  fileCount   1  foreach   computername in get-content  infile        write  computername   out-file -Append  path  fileCount  txt       lineCount        if   lineCount -eq 1000                 fileCount            lineCount   1

User · Answer

I found this question while trying to split multiple contacts in a single vCard VCF file to separate files  Here s what I did based on Lee s code  I had to look up how to create a new StreamReader object and changed null to  null    reader   new-object System IO StreamReader  C  Contacts vcf    count   1  filename    C  Contacts  0  vcf  -f   count    while   line    reader ReadLine    -ne  null        Add-Content -path  fileName -value  line      if  line -eq  END VCARD                    count          filename    C  Contacts  0  vcf  -f   count            reader Close

User · Answer

A word of warning about some of the existing answers - they will run very slow for very big files  For a 1 6 nbsp GB log file I gave up after a couple of hours  realising it would not finish before I returned to work the next day   Two issues  the call to Add-Content opens  seeks and then closes the current destination file for every line in the source file  Reading a little of the source file each time and looking for the new lines will also slows things down  but my guess is that Add-Content is the main culprit   The following variant produces slightly less pleasant output  it will split files in the middle of lines  but it splits my 1 6 nbsp GB log in less than a minute    from    C  temp large log txt   rootName    C  temp large log chunk   ext    txt   upperBound   100MB    fromFile    io file   OpenRead  from   buff   new-object byte    upperBound  count    idx   0 try       do            Reading  upperBound           count    fromFile Read  buff  0   buff Length          if   count -gt 0                 to     0   1   2   -f   rootName   idx   ext               toFile    io file   OpenWrite  to              try                    Writing  count to  to                   tofile Write  buff  0   count                finally                    tofile Close                                    idx          while   count -gt 0    finally        fromFile Close

User · Answer

Many of these answers were too slow for my source files  My source files were SQL files between 10 nbsp MB and 800 nbsp MB that needed to split into files of roughly equal line counts   I found some of the previous answers which use Add-Content to be quite slow  Waiting many hours for a split to finish wasn t uncommon   I didn t try Typhlosaurus s answer  but it looks to only do splits by file size  not line count   The following has suited my purposes    sw   new-object System Diagnostics Stopwatch  sw Start   Write-Host  Reading source file      lines    System IO File   ReadAllLines  C  Temp SplitTest source sql    totalLines    lines Length  Write-Host  Total Lines     totalLines   skip   0  count   100000    Number of lines per file    File counter  with sort friendly name  fileNumber   1  fileNumberString    filenumber ToString  000    while   skip -le  totalLines         upper    skip    count - 1     if   upper -gt   lines Length - 1              upper    lines Length - 1              Write the lines      System IO File   WriteAllLines  C  Temp SplitTest result fileNumberString txt   lines   skip   upper           Increment counters      skip     count      fileNumber        fileNumberString    filenumber ToString  000       sw Stop    Write-Host  Split complete in    sw Elapsed TotalSeconds  seconds    For a 54 nbsp MB file  I get the output     Reading source file    Total Lines   910030 Split complete in  1 7056578 seconds   I hope others looking for a simple  line-based splitting script that matches my requirements will find this useful

User · Answer

I often need to do the same thing  The trick is getting the header repeated into each of the split chunks  I wrote the following cmdlet  PowerShell v2 CTP 3  and it does the trick                                                                                    SYNOPSIS   Breaks a text file into multiple text files in a destination  where each   file contains a maximum number of lines      DESCRIPTION   When working with files that have a header  it is often desirable to have   the header information repeated in all of the split files  Split-File   supports this functionality with the -rc  RepeatCount  parameter      PARAMETER Path   Specifies the path to an item  Wildcards are permitted      PARAMETER LiteralPath   Specifies the path to an item  Unlike Path  the value of LiteralPath is   used exactly as it is typed  No characters are interpreted as wildcards    If the path includes escape characters  enclose it in single quotation marks    Single quotation marks tell Windows PowerShell not to interpret any   characters as escape sequences      PARAMETER Destination    Or -d  The location in which to place the chunked output files      PARAMETER Count    Or -c  The maximum number of lines in each file      PARAMETER RepeatCount    Or -rc  Specifies the number of  header  lines from the input file that will   be repeated in each output file  Typically this is 0 or 1 but it can be any   number of lines      EXAMPLE   Split-File bigfile csv 3000 -rc 1     LINK    Out-TempFile                                                                                function Split-File         CmdletBinding DefaultParameterSetName  Path        param            Parameter ParameterSetName  Path   Position 1  Mandatory  true  ValueFromPipeline  true  ValueFromPipelineByPropertyName  true            String    Path            Alias  PSPath             Parameter ParameterSetName  LiteralPath   Mandatory  true  ValueFromPipelineByPropertyName  true            String    LiteralPath            Alias  c             Parameter Position 2 Mandatory  true            Int32  Count            Alias  d             Parameter Position 3            String  Destination                Alias  rc             Parameter             Int32  RepeatCount             process              yeah  the cmdlet supports wildcards         if   LiteralPath     ResolveArgs     LiteralPath  LiteralPath            elseif   Path     ResolveArgs     Path  Path             Resolve-Path  ResolveArgs                    InputName    IO Path   GetFileNameWithoutExtension                  InputExt     IO Path   GetExtension                  if   RepeatCount     Header   Get-Content    -TotalCount  RepeatCount                  get the input file in manageable chunks               Part   1             Get-Content    -ReadCount  Count                         make an output filename with a suffix                  OutputFile   Join-Path  Destination    0 - 1 0000  2   -f   InputName  Part  InputExt                      In the first iteration the header will be                   copied to the output file as usual                   on subsequent iterations we have to do it                 if   RepeatCount -and  Part -gt 1                        Set-Content  OutputFile  Header                                      write this chunk to the output file                 Write-Host  Writing  OutputFile                  Add-Content  OutputFile                      Part    1

[powershell] How can I split a text file using PowerShell?

Examples related to powershell