[bash] Select random lines from a file

In a Bash script, I want to pick out N random lines from input file and output to another file.

How can this be done?

This question is related to bash shell random text-processing

The answer is


seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'

# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
    N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

    N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

    N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

    # vector to have the 0 (fail) with size of N_t_m_d 
    echo '0' > vector_0.temp
    for i in $(seq 1 1 $N_t_m_d); do
            echo "0" >> vector_0.temp
    done

    # vector to have the 1 (success) with size of desired number of lines
    echo '1' > vector_1.temp
    for i in $(seq 1 1 $N_d_m_1); do
            echo "1" >> vector_1.temp
    done

    cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

    paste -d" " rand_vector.temp $1 |
    awk '$1 != 0 {$1=""; print}' |
    sed 's/^ *//' > sampled_file.txt # file with the sampled lines

    rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"

Sort the file randomly and pick first 100 lines:

$ sort -R input | head -n 100 >output

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

Challenge accepted...

EDIT: I beat my own record

powershuf did it in 0.047 seconds

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.

Gitlab Repo

Old attempt

First I needed a file of 78.000.000.000 lines:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billion newlines ;-)

Now for the shuf part:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

Python is what I regularly use so that's what I'll use to make this faster:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

I know it can get faster but I'll leave some room to give others a try.

Line counter source: Luther Blissett


My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed

time \
awk 'BEGIN  {srand()} 
     !/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt

# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total

Use shuf with the -n option as shown below, to get N random lines:

shuf -n N input > output

Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to shell

Comparing a variable with a string python not working when redirecting from bash script Get first line of a shell command's output How to run shell script file using nodejs? Run bash command on jenkins pipeline Way to create multiline comments in Bash? How to do multiline shell script in Ansible How to check if a file exists in a shell script How to check if an environment variable exists and get its value? Curl to return http status code along with the response docker entrypoint running bash script gets "permission denied"

Examples related to random

How can I get a random number in Kotlin? scikit-learn random state in splitting dataset Random number between 0 and 1 in python In python, what is the difference between random.uniform() and random.random()? Generate random colors (RGB) Random state (Pseudo-random number) in Scikit learn How does one generate a random number in Apple's Swift language? How to generate a random string of a fixed length in Go? Generate 'n' unique random numbers within a range What does random.sample() method in python do?

Examples related to text-processing

Using multiple delimiters in awk Select random lines from a file How to add a new line of text to an existing file in Java? Add a prefix string to beginning of each line Remove empty lines in a text file via grep How to convert all text to lowercase in Vim How to replace ${} placeholders in a text file? How to use sed to replace only the first occurrence in a file? How can I extract a predetermined range of lines from a text file on Unix?