[bash] How to split a large text file into smaller files with equal number of lines?

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

I could do this fairly easily in Python but I'm wondering if there's any kind of ninja way to do this using bash and unix utils (as opposed to manually looping and counting / partitioning lines).

This question is related to bash file unix

The answer is


Yes, there is a split command. It will split a file by lines or bytes.

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

HDFS getmerge small file and spilt into property size.

This method will cause line break

split -b 125m compact.file -d -a 3 compact_prefix

I try to getmerge and split into about 128MB every file.

# split into 128m ,judge sizeunit is M or G ,please test before use.

begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
    res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
    res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# split into $res files with number suffix.  ref  http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name :"$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}

use split

Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-')

Syntax split [options] [INPUT [PREFIX]]

http://ss64.com/bash/split.html


split the file "file.txt" into 10000 lines files:

split -l 10000 file.txt

How about the split command?

split -l 200000 mybigfile.txt

split (from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:

-n, --number=CHUNKS     generate CHUNKS output files; see explanation below

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

Thus, split -n 4 input output. will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.

If we want to preserve full lines (i.e. split by lines), then this should work:

split -n l/4 input output.

Related answer: https://stackoverflow.com/a/19031247


Use:

sed -n '1,100p' filename > output.txt

Here, 1 and 100 are the line numbers which you will capture in output.txt.


In case you just want to split by x number of lines each file, the given answers about split are OK. But, i am curious about no one paid attention to requirements:

  • "without having to count them" -> using wc + cut
  • "having the remainder in extra file" -> split does by default

I can't do that without "wc + cut", but I'm using that:

split -l  $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename

This can be easily added to your bashrc functions so you can just invoke it passing filename and chunks:

 split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2) $1

In case you want just x chunks without remainder in extra file, just adapt the formula to sum it (chunks - 1) on each file. I do use this approach because usually i just want x number of files rather than x lines per file:

split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2 + `expr $2 - 1`) $1

You can add that to a script and call it your "ninja way", because if nothing suites your needs, you can build it :-)


you can also use awk

awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile

Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to file

Gradle - Move a folder from ABC to XYZ Difference between opening a file in binary vs text Angular: How to download a file from HttpClient? Python error message io.UnsupportedOperation: not readable java.io.FileNotFoundException: class path resource cannot be opened because it does not exist Writing JSON object to a JSON file with fs.writeFileSync How to read/write files in .Net Core? How to write to a CSV line by line? Writing a dictionary to a text file? What are the pros and cons of parquet format compared to other formats?

Examples related to unix

Docker CE on RHEL - Requires: container-selinux >= 2.9 What does `set -x` do? How to find files modified in last x minutes (find -mmin does not work as expected) sudo: npm: command not found How to sort a file in-place How to read a .properties file which contains keys that have a period character using Shell script gpg decryption fails with no secret key error Loop through a comma-separated shell variable Best way to find os name and version in Unix/Linux platform Resource u'tokenizers/punkt/english.pickle' not found