[algorithm] When is each sorting algorithm used?

What are the use cases when a particular sorting algorithm is preferred over others - merge sort vs QuickSort vs heapsort vs 'intro sort', etc?

Is there a recommended guide in using them based on the size, type of data structure, available memory and cache, and CPU performance?

This question is related to algorithm sorting

The answer is


The Wikipedia page on sorting algorithms has a great comparison chart.

http://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_of_algorithms


First, a definition, since it's pretty important: A stable sort is one that's guaranteed not to reorder elements with identical keys.

Recommendations:

Quick sort: When you don't need a stable sort and average case performance matters more than worst case performance. A quick sort is O(N log N) on average, O(N^2) in the worst case. A good implementation uses O(log N) auxiliary storage in the form of stack space for recursion.

Merge sort: When you need a stable, O(N log N) sort, this is about your only option. The only downsides to it are that it uses O(N) auxiliary space and has a slightly larger constant than a quick sort. There are some in-place merge sorts, but AFAIK they are all either not stable or worse than O(N log N). Even the O(N log N) in place sorts have so much larger a constant than the plain old merge sort that they're more theoretical curiosities than useful algorithms.

Heap sort: When you don't need a stable sort and you care more about worst case performance than average case performance. It's guaranteed to be O(N log N), and uses O(1) auxiliary space, meaning that you won't unexpectedly run out of heap or stack space on very large inputs.

Introsort: This is a quick sort that switches to a heap sort after a certain recursion depth to get around quick sort's O(N^2) worst case. It's almost always better than a plain old quick sort, since you get the average case of a quick sort, with guaranteed O(N log N) performance. Probably the only reason to use a heap sort instead of this is in severely memory constrained systems where O(log N) stack space is practically significant.

Insertion sort: When N is guaranteed to be small, including as the base case of a quick sort or merge sort. While this is O(N^2), it has a very small constant and is a stable sort.

Bubble sort, selection sort: When you're doing something quick and dirty and for some reason you can't just use the standard library's sorting algorithm. The only advantage these have over insertion sort is being slightly easier to implement.


Non-comparison sorts: Under some fairly limited conditions it's possible to break the O(N log N) barrier and sort in O(N). Here are some cases where that's worth a try:

Counting sort: When you are sorting integers with a limited range.

Radix sort: When log(N) is significantly larger than K, where K is the number of radix digits.

Bucket sort: When you can guarantee that your input is approximately uniformly distributed.


What the provided links to comparisons/animations do not consider is when the amount of data exceed available memory --- at which point the number of passes over the data, i.e. I/O-costs, dominate the runtime. If you need to do that, read up on "external sorting" which usually cover variants of merge- and heap sorts.

http://corte.si/posts/code/visualisingsorting/index.html and http://corte.si/posts/code/timsort/index.html also have some cool images comparing various sorting algorithms.


Quicksort is usually the fastest on average, but It has some pretty nasty worst-case behaviors. So if you have to guarantee no bad data gives you O(N^2), you should avoid it.

Merge-sort uses extra memory, but is particularly suitable for external sorting (i.e. huge files that don't fit into memory).

Heap-sort can sort in-place and doesn't have the worst case quadratic behavior, but on average is slower than quicksort in most cases.

Where only integers in a restricted range are involved, you can use some kind of radix sort to make it very fast.

In 99% of the cases, you'll be fine with the library sorts, which are usually based on quicksort.


@dsimcha wrote: Counting sort: When you are sorting integers with a limited range

I would change that to:

Counting sort: When you sort positive integers (0 - Integer.MAX_VALUE-2 due to the pigeonhole).

You can always get the max and min values as an efficiency heuristic in linear time as well.
Also you need at least n extra space for the intermediate array and it is stable obviously.

/**
* Some VMs reserve some header words in an array.
* Attempts to allocate larger arrays may result in
* OutOfMemoryError: Requested array size exceeds VM limit
*/
private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

(even though it actually will allow to MAX_VALUE-2) see: Do Java arrays have a maximum size?

Also I would explain that radix sort complexity is O(wn) for n keys which are integers of word size w. Sometimes w is presented as a constant, which would make radix sort better (for sufficiently large n) than the best comparison-based sorting algorithms, which all perform O(n log n) comparisons to sort n keys. However, in general w cannot be considered a constant: if all n keys are distinct, then w has to be at least log n for a random-access machine to be able to store them in memory, which gives at best a time complexity O(n log n). (from wikipedia)