[.net] HashSet vs. List performance

It's clear that a search performance of the generic HashSet<T> class is higher than of the generic List<T> class. Just compare the hash-based key with the linear approach in the List<T> class.

However calculating a hash key may itself take some CPU cycles, so for a small amount of items the linear search can be a real alternative to the HashSet<T>.

My question: where is the break-even?

To simplify the scenario (and to be fair) let's assume that the List<T> class uses the element's Equals() method to identify an item.

This question is related to .net performance collections list hash

The answer is


It depends. If the exact answer really matters, do some profiling and find out. If you're sure you'll never have more than a certain number of elements in the set, go with a List. If the number is unbounded, use a HashSet.


Whether to use a HashSet<> or List<> comes down to how you need to access your collection. If you need to guarantee the order of items, use a List. If you don't, use a HashSet. Let Microsoft worry about the implementation of their hashing algorithms and objects.

A HashSet will access items without having to enumerate the collection (complexity of O(1) or near it), and because a List guarantees order, unlike a HashSet, some items will have to be enumerated (complexity of O(n)).


You're looking at this wrong. Yes a linear search of a List will beat a HashSet for a small number of items. But the performance difference usually doesn't matter for collections that small. It's generally the large collections you have to worry about, and that's where you think in terms of Big-O. However, if you've measured a real bottleneck on HashSet performance, then you can try to create a hybrid List/HashSet, but you'll do that by conducting lots of empirical performance tests - not asking questions on SO.


You can use a HybridDictionary which automaticly detects the breaking point, and accepts null-values, making it essentialy the same as a HashSet.


The breakeven will depend on the cost of computing the hash. Hash computations can be trivial, or not... :-) There is always the System.Collections.Specialized.HybridDictionary class to help you not have to worry about the breakeven point.


Just thought I'd chime in with some benchmarks for different scenarios to illustrate the previous answers:

  1. A few (12 - 20) small strings (length between 5 and 10 characters)
  2. Many (~10K) small strings
  3. A few long strings (length between 200 and 1000 characters)
  4. Many (~5K) long strings
  5. A few integers
  6. Many (~10K) integers

And for each scenario, looking up values which appear:

  1. In the beginning of the list ("start", index 0)
  2. Near the beginning of the list ("early", index 1)
  3. In the middle of the list ("middle", index count/2)
  4. Near the end of the list ("late", index count-2)
  5. At the end of the list ("end", index count-1)

Before each scenario I generated randomly sized lists of random strings, and then fed each list to a hashset. Each scenario ran 10,000 times, essentially:

(test pseudocode)

stopwatch.start
for X times
    exists = list.Contains(lookup);
stopwatch.stop

stopwatch.start
for X times
    exists = hashset.Contains(lookup);
stopwatch.stop

Sample Output

Tested on Windows 7, 12GB Ram, 64 bit, Xeon 2.8GHz

---------- Testing few small strings ------------
Sample items: (16 total)
vgnwaloqf diwfpxbv tdcdc grfch icsjwk
...

Benchmarks:
1: hashset: late -- 100.00 % -- [Elapsed: 0.0018398 sec]
2: hashset: middle -- 104.19 % -- [Elapsed: 0.0019169 sec]
3: hashset: end -- 108.21 % -- [Elapsed: 0.0019908 sec]
4: list: early -- 144.62 % -- [Elapsed: 0.0026607 sec]
5: hashset: start -- 174.32 % -- [Elapsed: 0.0032071 sec]
6: list: middle -- 187.72 % -- [Elapsed: 0.0034536 sec]
7: list: late -- 192.66 % -- [Elapsed: 0.0035446 sec]
8: list: end -- 215.42 % -- [Elapsed: 0.0039633 sec]
9: hashset: early -- 217.95 % -- [Elapsed: 0.0040098 sec]
10: list: start -- 576.55 % -- [Elapsed: 0.0106073 sec]


---------- Testing many small strings ------------
Sample items: (10346 total)
dmnowa yshtrxorj vthjk okrxegip vwpoltck
...

Benchmarks:
1: hashset: end -- 100.00 % -- [Elapsed: 0.0017443 sec]
2: hashset: late -- 102.91 % -- [Elapsed: 0.0017951 sec]
3: hashset: middle -- 106.23 % -- [Elapsed: 0.0018529 sec]
4: list: early -- 107.49 % -- [Elapsed: 0.0018749 sec]
5: list: start -- 126.23 % -- [Elapsed: 0.0022018 sec]
6: hashset: early -- 134.11 % -- [Elapsed: 0.0023393 sec]
7: hashset: start -- 372.09 % -- [Elapsed: 0.0064903 sec]
8: list: middle -- 48,593.79 % -- [Elapsed: 0.8476214 sec]
9: list: end -- 99,020.73 % -- [Elapsed: 1.7272186 sec]
10: list: late -- 99,089.36 % -- [Elapsed: 1.7284155 sec]


---------- Testing few long strings ------------
Sample items: (19 total)
hidfymjyjtffcjmlcaoivbylakmqgoiowbgxpyhnrreodxyleehkhsofjqenyrrtlphbcnvdrbqdvji...
...

Benchmarks:
1: list: early -- 100.00 % -- [Elapsed: 0.0018266 sec]
2: list: start -- 115.76 % -- [Elapsed: 0.0021144 sec]
3: list: middle -- 143.44 % -- [Elapsed: 0.0026201 sec]
4: list: late -- 190.05 % -- [Elapsed: 0.0034715 sec]
5: list: end -- 193.78 % -- [Elapsed: 0.0035395 sec]
6: hashset: early -- 215.00 % -- [Elapsed: 0.0039271 sec]
7: hashset: end -- 248.47 % -- [Elapsed: 0.0045386 sec]
8: hashset: start -- 298.04 % -- [Elapsed: 0.005444 sec]
9: hashset: middle -- 325.63 % -- [Elapsed: 0.005948 sec]
10: hashset: late -- 431.62 % -- [Elapsed: 0.0078839 sec]


---------- Testing many long strings ------------
Sample items: (5000 total)
yrpjccgxjbketcpmnvyqvghhlnjblhgimybdygumtijtrwaromwrajlsjhxoselbucqualmhbmwnvnpnm
...

Benchmarks:
1: list: early -- 100.00 % -- [Elapsed: 0.0016211 sec]
2: list: start -- 132.73 % -- [Elapsed: 0.0021517 sec]
3: hashset: start -- 231.26 % -- [Elapsed: 0.003749 sec]
4: hashset: end -- 368.74 % -- [Elapsed: 0.0059776 sec]
5: hashset: middle -- 385.50 % -- [Elapsed: 0.0062493 sec]
6: hashset: late -- 406.23 % -- [Elapsed: 0.0065854 sec]
7: hashset: early -- 421.34 % -- [Elapsed: 0.0068304 sec]
8: list: middle -- 18,619.12 % -- [Elapsed: 0.3018345 sec]
9: list: end -- 40,942.82 % -- [Elapsed: 0.663724 sec]
10: list: late -- 41,188.19 % -- [Elapsed: 0.6677017 sec]


---------- Testing few ints ------------
Sample items: (16 total)
7266092 60668895 159021363 216428460 28007724
...

Benchmarks:
1: hashset: early -- 100.00 % -- [Elapsed: 0.0016211 sec]
2: hashset: end -- 100.45 % -- [Elapsed: 0.0016284 sec]
3: list: early -- 101.83 % -- [Elapsed: 0.0016507 sec]
4: hashset: late -- 108.95 % -- [Elapsed: 0.0017662 sec]
5: hashset: middle -- 112.29 % -- [Elapsed: 0.0018204 sec]
6: hashset: start -- 120.33 % -- [Elapsed: 0.0019506 sec]
7: list: late -- 134.45 % -- [Elapsed: 0.0021795 sec]
8: list: start -- 136.43 % -- [Elapsed: 0.0022117 sec]
9: list: end -- 169.77 % -- [Elapsed: 0.0027522 sec]
10: list: middle -- 237.94 % -- [Elapsed: 0.0038573 sec]


---------- Testing many ints ------------
Sample items: (10357 total)
370826556 569127161 101235820 792075135 270823009
...

Benchmarks:
1: list: early -- 100.00 % -- [Elapsed: 0.0015132 sec]
2: hashset: end -- 101.79 % -- [Elapsed: 0.0015403 sec]
3: hashset: early -- 102.08 % -- [Elapsed: 0.0015446 sec]
4: hashset: middle -- 103.21 % -- [Elapsed: 0.0015618 sec]
5: hashset: late -- 104.26 % -- [Elapsed: 0.0015776 sec]
6: list: start -- 126.78 % -- [Elapsed: 0.0019184 sec]
7: hashset: start -- 130.91 % -- [Elapsed: 0.0019809 sec]
8: list: middle -- 16,497.89 % -- [Elapsed: 0.2496461 sec]
9: list: end -- 32,715.52 % -- [Elapsed: 0.4950512 sec]
10: list: late -- 33,698.87 % -- [Elapsed: 0.5099313 sec]

Depends on a lot of factors... List implementation, CPU architecture, JVM, loop semantics, complexity of equals method, etc... By the time the list gets big enough to effectively benchmark (1000+ elements), Hash-based binary lookups beat linear searches hands-down, and the difference only scales up from there.

Hope this helps!


One factor your not taking into account is the robustness of the GetHashcode() function. With a perfect hash function the HashSet will clearly have better searching performance. But as the hash function diminishes so will the HashSet search time.


It's essentially pointless to compare two structures for performance that behave differently. Use the structure that conveys the intent. Even if you say your List<T> wouldn't have duplicates and iteration order doesn't matter making it comparable to a HashSet<T>, its still a poor choice to use List<T> because its relatively less fault tolerant.

That said, I will inspect some other aspects of performance,

+------------+--------+-------------+-----------+----------+----------+-----------+
| Collection | Random | Containment | Insertion | Addition |  Removal | Memory    |
|            | access |             |           |          |          |           |
+------------+--------+-------------+-----------+----------+----------+-----------+
| List<T>    | O(1)   | O(n)        | O(n)      | O(1)*    | O(n)     | Lesser    |
| HashSet<T> | O(n)   | O(1)        | n/a       | O(1)     | O(1)     | Greater** |
+------------+--------+-------------+-----------+----------+----------+-----------+
  • Even though addition is O(1) in both cases, it will be relatively slower in HashSet since it involves cost of precomputing hash code before storing it.

  • The superior scalability of HashSet has a memory cost. Every entry is stored as a new object along with its hash code. This article might give you an idea.


Depends on what you're hashing. If your keys are integers you probably don't need very many items before the HashSet is faster. If you're keying it on a string then it will be slower, and depends on the input string.

Surely you could whip up a benchmark pretty easily?


The answer, as always, is "It depends". I assume from the tags you're talking about C#.

Your best bet is to determine

  1. A Set of data
  2. Usage requirements

and write some test cases.

It also depends on how you sort the list (if it's sorted at all), what kind of comparisons need to be made, how long the "Compare" operation takes for the particular object in the list, or even how you intend to use the collection.

Generally, the best one to choose isn't so much based on the size of data you're working with, but rather how you intend to access it. Do you have each piece of data associated with a particular string, or other data? A hash based collection would probably be best. Is the order of the data you're storing important, or are you going to need to access all of the data at the same time? A regular list may be better then.

Additional:

Of course, my above comments assume 'performance' means data access. Something else to consider: what are you looking for when you say "performance"? Is performance individual value look up? Is it management of large (10000, 100000 or more) value sets? Is it the performance of filling the data structure with data? Removing data? Accessing individual bits of data? Replacing values? Iterating over the values? Memory usage? Data copying speed? For example, If you access data by a string value, but your main performance requirement is minimal memory usage, you might have conflicting design issues.


Examples related to .net

You must add a reference to assembly 'netstandard, Version=2.0.0.0 How to use Bootstrap 4 in ASP.NET Core No authenticationScheme was specified, and there was no DefaultChallengeScheme found with default authentification and custom authorization .net Core 2.0 - Package was restored using .NetFramework 4.6.1 instead of target framework .netCore 2.0. The package may not be fully compatible Update .NET web service to use TLS 1.2 EF Core add-migration Build Failed What is the difference between .NET Core and .NET Standard Class Library project types? Visual Studio 2017 - Could not load file or assembly 'System.Runtime, Version=4.1.0.0' or one of its dependencies Nuget connection attempt failed "Unable to load the service index for source" Token based authentication in Web API without any user interface

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to collections

Kotlin's List missing "add", "remove", Map missing "put", etc? How to unset (remove) a collection element after fetching it? How can I get a List from some class properties with Java 8 Stream? Java 8 stream map to list of keys sorted by values How to convert String into Hashmap in java How can I turn a List of Lists into a List in Java 8? MongoDB Show all contents from all collections Get nth character of a string in Swift programming language Java 8 Distinct by property Is there a typescript List<> and/or Map<> class/library?

Examples related to list

Convert List to Pandas Dataframe Column Python find elements in one list that are not in the other Sorting a list with stream.sorted() in Java Python Loop: List Index Out of Range How to combine two lists in R How do I multiply each element in a list by a number? Save a list to a .txt file The most efficient way to remove first N elements in a list? TypeError: list indices must be integers or slices, not str Parse JSON String into List<string>

Examples related to hash

php mysqli_connect: authentication method unknown to the client [caching_sha2_password] What is Hash and Range Primary Key? How to create a laravel hashed password Hashing a file in Python PHP salt and hash SHA256 for login password Append key/value pair to hash with << in Ruby Are there any SHA-256 javascript implementations that are generally considered trustworthy? How do I generate a SALT in Java for Salted-Hash? What does hash do in python? Hashing with SHA1 Algorithm in C#