[python] Get random sample from list while maintaining ordering of items?

I have a sorted list, let say: (its not really just numbers, its a list of objects that are sorted with a complicated time consuming algorithm)

mylist = [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 ,9  , 10 ]

Is there some python function that will give me N of the items, but will keep the order?

Example:

randomList = getRandom(mylist,4)
# randomList = [ 3 , 6 ,7 , 9 ]
randomList = getRandom(mylist,4)
# randomList = [ 1 , 2 , 4 , 8 ]

etc...

This question is related to python list random sortedlist

The answer is


Following code will generate a random sample of size 4:

import random

sample_size = 4
sorted_sample = [
    mylist[i] for i in sorted(random.sample(range(len(mylist)), sample_size))
]

(note: with Python 2, better use xrange instead of range)

Explanation

random.sample(range(len(mylist)), sample_size)

generates a random sample of the indices of the original list.

These indices then get sorted to preserve the ordering of elements in the original list.

Finally, the list comprehension pulls out the actual elements from the original list, given the sampled indices.


random.sample implement it.

>>> random.sample([1, 2, 3, 4, 5],  3)   # Three samples without replacement
[4, 1, 5]

Maybe you can just generate the sample of indices and then collect the items from your list.

randIndex = random.sample(range(len(mylist)), sample_size)
randIndex.sort()
rand = [mylist[i] for i in randIndex]

Apparently random.sample was introduced in python 2.3

so for version under that, we can use shuffle (example for 4 items):

myRange =  range(0,len(mylist)) 
shuffle(myRange)
coupons = [ bestCoupons[i] for i in sorted(myRange[:4]) ]

Simple-to-code O(N + K*log(K)) way

Take a random sample without replacement of the indices, sort the indices, and take them from the original.

indices = random.sample(range(len(myList)), K)
[myList[i] for i in sorted(indices)]

Or more concisely:

[x[1] for x in sorted(random.sample(enumerate(myList),K))]

Optimized O(N)-time, O(1)-auxiliary-space way

You can alternatively use a math trick and iteratively go through myList from left to right, picking numbers with dynamically-changing probability (N-numbersPicked)/(total-numbersVisited). The advantage of this approach is that it's an O(N) algorithm since it doesn't involve sorting!

from __future__ import division

def orderedSampleWithoutReplacement(seq, k):
    if not 0<=k<=len(seq):
        raise ValueError('Required that 0 <= sample_size <= population_size')

    numbersPicked = 0
    for i,number in enumerate(seq):
        prob = (k-numbersPicked)/(len(seq)-i)
        if random.random() < prob:
            yield number
            numbersPicked += 1

Proof of concept and test that probabilities are correct:

Simulated with 1 trillion pseudorandom samples over the course of 5 hours:

>>> Counter(
        tuple(orderedSampleWithoutReplacement([0,1,2,3], 2))
        for _ in range(10**9)
    )
Counter({
    (0, 3): 166680161, 
    (1, 2): 166672608, 
    (0, 2): 166669915, 
    (2, 3): 166667390, 
    (1, 3): 166660630, 
    (0, 1): 166649296
})

Probabilities diverge from true probabilities by less a factor of 1.0001. Running this test again resulted in a different order meaning it isn't biased towards one ordering. Running the test with fewer samples for [0,1,2,3,4], k=3 and [0,1,2,3,4,5], k=4 had similar results.

edit: Not sure why people are voting up wrong comments or afraid to upvote... NO, there is nothing wrong with this method. =)

(Also a useful note from user tegan in the comments: If this is python2, you will want to use xrange, as usual, if you really care about extra space.)

edit: Proof: Considering the uniform distribution (without replacement) of picking a subset of k out of a population seq of size len(seq), we can consider a partition at an arbitrary point i into 'left' (0,1,...,i-1) and 'right' (i,i+1,...,len(seq)). Given that we picked numbersPicked from the left known subset, the remaining must come from the same uniform distribution on the right unknown subset, though the parameters are now different. In particular, the probability that seq[i] contains a chosen element is #remainingToChoose/#remainingToChooseFrom, or (k-numbersPicked)/(len(seq)-i), so we simulate that and recurse on the result. (This must terminate since if #remainingToChoose == #remainingToChooseFrom, then all remaining probabilities are 1.) This is similar to a probability tree that happens to be dynamically generated. Basically you can simulate a uniform probability distribution by conditioning on prior choices (as you grow the probability tree, you pick the probability of the current branch such that it is aposteriori the same as prior leaves, i.e. conditioned on prior choices; this will work because this probability is uniformly exactly N/k).

edit: Timothy Shields mentions Reservoir Sampling, which is the generalization of this method when len(seq) is unknown (such as with a generator expression). Specifically the one noted as "algorithm R" is O(N) and O(1) space if done in-place; it involves taking the first N element and slowly replacing them (a hint at an inductive proof is also given). There are also useful distributed variants and miscellaneous variants of reservoir sampling to be found on the wikipedia page.

edit: Here's another way to code it below in a more semantically obvious manner.

from __future__ import division
import random

def orderedSampleWithoutReplacement(seq, sampleSize):
    totalElems = len(seq)
    if not 0<=sampleSize<=totalElems:
        raise ValueError('Required that 0 <= sample_size <= population_size')

    picksRemaining = sampleSize
    for elemsSeen,element in enumerate(seq):
        elemsRemaining = totalElems - elemsSeen
        prob = picksRemaining/elemsRemaining
        if random.random() < prob:
            yield element
            picksRemaining -= 1

from collections import Counter         
Counter(
    tuple(orderedSampleWithoutReplacement([0,1,2,3], 2))
    for _ in range(10**5)

)


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to list

Convert List to Pandas Dataframe Column Python find elements in one list that are not in the other Sorting a list with stream.sorted() in Java Python Loop: List Index Out of Range How to combine two lists in R How do I multiply each element in a list by a number? Save a list to a .txt file The most efficient way to remove first N elements in a list? TypeError: list indices must be integers or slices, not str Parse JSON String into List<string>

Examples related to random

How can I get a random number in Kotlin? scikit-learn random state in splitting dataset Random number between 0 and 1 in python In python, what is the difference between random.uniform() and random.random()? Generate random colors (RGB) Random state (Pseudo-random number) in Scikit learn How does one generate a random number in Apple's Swift language? How to generate a random string of a fixed length in Go? Generate 'n' unique random numbers within a range What does random.sample() method in python do?

Examples related to sortedlist

lists and arrays in VBA How to Sort Date in descending order From Arraylist Date in android? Get random sample from list while maintaining ordering of items? What's the difference between SortedList and SortedDictionary?