Select random lines from a file

Question

In a Bash script  I want to pick out N random lines from input file and output to another file   How can this be done

User · Answer

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

Challenge accepted...

EDIT: I beat my own record

powershuf did it in 0.047 seconds

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.

Gitlab Repo

Old attempt

First I needed a file of 78.000.000.000 lines:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billion newlines ;-)

Now for the shuf part:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

Python is what I regularly use so that's what I'll use to make this faster:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

I know it can get faster but I'll leave some room to give others a try.

Line counter source: Luther Blissett

User · Answer

seq 1 100   python3 -c  print   import    quot random quot   choice   import    quot sys quot   stdin readlines

User · Answer

Function to sample N lines randomly from a file   Parameter  1  Name of the original file   Parameter  2  N lines to be sampled  rand line sampler         N t   awk   print  1    1   wc -l    Number of total lines      N t m d      N t -  2 - 1      Number oftotal lines minus desired number of lines      N d m 1      2 - 1     Number of desired lines minus 1        vector to have the 0  fail  with size of N t m d      echo  0   gt  vector 0 temp     for i in   seq 1 1  N t m d   do             echo  quot 0 quot   gt  gt  vector 0 temp     done        vector to have the 1  success  with size of desired number of lines     echo  1   gt  vector 1 temp     for i in   seq 1 1  N d m 1   do             echo  quot 1 quot   gt  gt  vector 1 temp     done      cat vector 1 temp vector 0 temp   shuf  gt  rand vector temp      paste -d quot   quot  rand vector temp  1       awk   1    0   1  quot  quot   print         sed  s         gt  sampled file txt   file with the sampled lines      rm vector 0 temp vector 1 temp rand vector temp    rand line sampler  quot parameter 1 quot   quot parameter 2 quot

User · Answer

My preferred option is very fast  I sampled a tab-delimited data file with 13 columns  23 1M rows  2 0GB uncompressed    randomly sample select 5  of lines in file   including header row  exclude blank lines  new seed  time   awk  BEGIN   srand                   if  rand    lt    05    FNR  1  print  gt   quot data-sample txt quot    data txt    awk  tsv004  3 76s user 1 46s system 91  cpu 5 716 total

User · Answer

Sort the file randomly and pick first 100 lines     sort -R input   head -n 100  gt output

User · Answer

Use shuf with the -n option as shown below  to get N random lines   shuf -n N input  gt  output

[bash] Select random lines from a file

powershuf did it in 0.047 seconds

Old attempt

Examples related to bash

Examples related to shell

Examples related to random

Examples related to text-processing