HashSet vs List performance

Question

It s clear that a search performance of the generic HashSet lt T gt  class is higher than of the generic List lt T gt  class  Just compare the hash-based key with the linear approach in the List lt T gt  class   However calculating a hash key may itself take some CPU cycles  so for a small amount of items the linear search can be a real alternative to the HashSet lt T gt    My question  where is the break-even   To simplify the scenario  and to be fair  let s assume that the List lt T gt  class uses the element s Equals   method to identify an item

User · Accepted Answer

A lot of people are saying that once you get to the size where speed is actually a concern that HashSet<T> will always beat List<T>, but that depends on what you are doing.

Let's say you have a List<T> that will only ever have on average 5 items in it. Over a large number of cycles, if a single item is added or removed each cycle, you may well be better off using a List<T>.

I did a test for this on my machine, and, well, it has to be very very small to get an advantage from List<T>. For a list of short strings, the advantage went away after size 5, for objects after size 20.

1 item LIST strs time: 617ms
1 item HASHSET strs time: 1332ms

2 item LIST strs time: 781ms
2 item HASHSET strs time: 1354ms

3 item LIST strs time: 950ms
3 item HASHSET strs time: 1405ms

4 item LIST strs time: 1126ms
4 item HASHSET strs time: 1441ms

5 item LIST strs time: 1370ms
5 item HASHSET strs time: 1452ms

6 item LIST strs time: 1481ms
6 item HASHSET strs time: 1418ms

7 item LIST strs time: 1581ms
7 item HASHSET strs time: 1464ms

8 item LIST strs time: 1726ms
8 item HASHSET strs time: 1398ms

9 item LIST strs time: 1901ms
9 item HASHSET strs time: 1433ms

1 item LIST objs time: 614ms
1 item HASHSET objs time: 1993ms

4 item LIST objs time: 837ms
4 item HASHSET objs time: 1914ms

7 item LIST objs time: 1070ms
7 item HASHSET objs time: 1900ms

10 item LIST objs time: 1267ms
10 item HASHSET objs time: 1904ms

13 item LIST objs time: 1494ms
13 item HASHSET objs time: 1893ms

16 item LIST objs time: 1695ms
16 item HASHSET objs time: 1879ms

19 item LIST objs time: 1902ms
19 item HASHSET objs time: 1950ms

22 item LIST objs time: 2136ms
22 item HASHSET objs time: 1893ms

25 item LIST objs time: 2357ms
25 item HASHSET objs time: 1826ms

28 item LIST objs time: 2555ms
28 item HASHSET objs time: 1865ms

31 item LIST objs time: 2755ms
31 item HASHSET objs time: 1963ms

34 item LIST objs time: 3025ms
34 item HASHSET objs time: 1874ms

37 item LIST objs time: 3195ms
37 item HASHSET objs time: 1958ms

40 item LIST objs time: 3401ms
40 item HASHSET objs time: 1855ms

43 item LIST objs time: 3618ms
43 item HASHSET objs time: 1869ms

46 item LIST objs time: 3883ms
46 item HASHSET objs time: 2046ms

49 item LIST objs time: 4218ms
49 item HASHSET objs time: 1873ms

Here is that data displayed as a graph:

enter image description here

Here's the code:

static void Main(string[] args)
{
    int times = 10000000;


    for (int listSize = 1; listSize < 10; listSize++)
    {
        List<string> list = new List<string>();
        HashSet<string> hashset = new HashSet<string>();

        for (int i = 0; i < listSize; i++)
        {
            list.Add("string" + i.ToString());
            hashset.Add("string" + i.ToString());
        }

        Stopwatch timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < times; i++)
        {
            list.Remove("string0");
            list.Add("string0");
        }
        timer.Stop();
        Console.WriteLine(listSize.ToString() + " item LIST strs time: " + timer.ElapsedMilliseconds.ToString() + "ms");


        timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < times; i++)
        {
            hashset.Remove("string0");
            hashset.Add("string0");
        }
        timer.Stop();
        Console.WriteLine(listSize.ToString() + " item HASHSET strs time: " + timer.ElapsedMilliseconds.ToString() + "ms");
        Console.WriteLine();
    }


    for (int listSize = 1; listSize < 50; listSize+=3)
    {
        List<object> list = new List<object>();
        HashSet<object> hashset = new HashSet<object>();

        for (int i = 0; i < listSize; i++)
        {
            list.Add(new object());
            hashset.Add(new object());
        }

        object objToAddRem = list[0];

        Stopwatch timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < times; i++)
        {
            list.Remove(objToAddRem);
            list.Add(objToAddRem);
        }
        timer.Stop();
        Console.WriteLine(listSize.ToString() + " item LIST objs time: " + timer.ElapsedMilliseconds.ToString() + "ms");



        timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < times; i++)
        {
            hashset.Remove(objToAddRem);
            hashset.Add(objToAddRem);
        }
        timer.Stop();
        Console.WriteLine(listSize.ToString() + " item HASHSET objs time: " + timer.ElapsedMilliseconds.ToString() + "ms");
        Console.WriteLine();
    }

    Console.ReadLine();
}

User · Answer

Depends on a lot of factors    List implementation  CPU architecture  JVM  loop semantics  complexity of equals method  etc    By the time the list gets big enough to effectively benchmark  1000  elements   Hash-based binary lookups beat linear searches hands-down  and the difference only scales up from there    Hope this helps

User · Answer

The answer  as always  is  It depends    I assume from the tags you re talking about C    Your best bet is to determine   A Set of data Usage requirements   and write some test cases   It also depends on how you sort the list  if it s sorted at all   what kind of comparisons need to be made  how long the  Compare  operation takes for the particular object in the list  or even how you intend to use the collection   Generally  the best one to choose isn t so much based on the size of data you re working with  but rather how you intend to access it   Do you have each piece of data associated with a particular string  or other data   A hash based collection would probably be best   Is the order of the data you re storing important  or are you going to need to access all of the data at the same time   A regular list may be better then   Additional   Of course  my above comments assume  performance  means data access   Something else to consider  what are you looking for when you say  performance    Is performance individual value look up   Is it management of large  10000  100000 or more  value sets   Is it the performance of filling the data structure with data   Removing data   Accessing individual bits of data   Replacing values   Iterating over the values   Memory usage   Data copying speed   For example  If you access data by a string value  but your main performance requirement is minimal memory usage  you might have conflicting design issues

User · Answer

The breakeven will depend on the cost of computing the hash  Hash computations can be trivial  or not     -  There is always the System Collections Specialized HybridDictionary class to help you not have to worry about the breakeven point

User · Answer

Whether to use a HashSet lt   or List lt   comes down to how you need to access your collection  If you need to guarantee the order of items  use a List  If you don t  use a HashSet  Let Microsoft worry about the implementation of their hashing algorithms and objects   A HashSet will access items without having to enumerate the collection  complexity of O 1  or near it   and because a List guarantees order  unlike a HashSet  some items will have to be enumerated  complexity of O n

User · Answer

It depends   If the exact answer really matters  do some profiling and find out   If you re sure you ll never have more than a certain number of elements in the set  go with a List   If the number is unbounded  use a HashSet

User · Answer

Just thought I d chime in with some benchmarks for different scenarios to illustrate the previous answers    A few  12 - 20  small strings  length between 5 and 10 characters  Many   10K  small strings A few long strings  length between 200 and 1000 characters  Many   5K  long strings A few integers Many   10K  integers   And for each scenario  looking up values which appear    In the beginning of the list   start   index 0  Near the beginning of the list   early   index 1  In the middle of the list   middle   index count 2  Near the end of the list   late   index count-2  At the end of the list   end   index count-1    Before each scenario I generated randomly sized lists of random strings  and then fed each list to a hashset   Each scenario ran 10 000 times  essentially    test pseudocode   stopwatch start for X times     exists   list Contains lookup   stopwatch stop  stopwatch start for X times     exists   hashset Contains lookup   stopwatch stop   Sample Output  Tested on Windows 7  12GB Ram  64 bit  Xeon 2 8GHz  ---------- Testing few small strings ------------ Sample items   16 total  vgnwaloqf diwfpxbv tdcdc grfch icsjwk      Benchmarks  1  hashset  late -- 100 00   --  Elapsed  0 0018398 sec  2  hashset  middle -- 104 19   --  Elapsed  0 0019169 sec  3  hashset  end -- 108 21   --  Elapsed  0 0019908 sec  4  list  early -- 144 62   --  Elapsed  0 0026607 sec  5  hashset  start -- 174 32   --  Elapsed  0 0032071 sec  6  list  middle -- 187 72   --  Elapsed  0 0034536 sec  7  list  late -- 192 66   --  Elapsed  0 0035446 sec  8  list  end -- 215 42   --  Elapsed  0 0039633 sec  9  hashset  early -- 217 95   --  Elapsed  0 0040098 sec  10  list  start -- 576 55   --  Elapsed  0 0106073 sec    ---------- Testing many small strings ------------ Sample items   10346 total  dmnowa yshtrxorj vthjk okrxegip vwpoltck      Benchmarks  1  hashset  end -- 100 00   --  Elapsed  0 0017443 sec  2  hashset  late -- 102 91   --  Elapsed  0 0017951 sec  3  hashset  middle -- 106 23   --  Elapsed  0 0018529 sec  4  list  early -- 107 49   --  Elapsed  0 0018749 sec  5  list  start -- 126 23   --  Elapsed  0 0022018 sec  6  hashset  early -- 134 11   --  Elapsed  0 0023393 sec  7  hashset  start -- 372 09   --  Elapsed  0 0064903 sec  8  list  middle -- 48 593 79   --  Elapsed  0 8476214 sec  9  list  end -- 99 020 73   --  Elapsed  1 7272186 sec  10  list  late -- 99 089 36   --  Elapsed  1 7284155 sec    ---------- Testing few long strings ------------ Sample items   19 total  hidfymjyjtffcjmlcaoivbylakmqgoiowbgxpyhnrreodxyleehkhsofjqenyrrtlphbcnvdrbqdvji         Benchmarks  1  list  early -- 100 00   --  Elapsed  0 0018266 sec  2  list  start -- 115 76   --  Elapsed  0 0021144 sec  3  list  middle -- 143 44   --  Elapsed  0 0026201 sec  4  list  late -- 190 05   --  Elapsed  0 0034715 sec  5  list  end -- 193 78   --  Elapsed  0 0035395 sec  6  hashset  early -- 215 00   --  Elapsed  0 0039271 sec  7  hashset  end -- 248 47   --  Elapsed  0 0045386 sec  8  hashset  start -- 298 04   --  Elapsed  0 005444 sec  9  hashset  middle -- 325 63   --  Elapsed  0 005948 sec  10  hashset  late -- 431 62   --  Elapsed  0 0078839 sec    ---------- Testing many long strings ------------ Sample items   5000 total  yrpjccgxjbketcpmnvyqvghhlnjblhgimybdygumtijtrwaromwrajlsjhxoselbucqualmhbmwnvnpnm      Benchmarks  1  list  early -- 100 00   --  Elapsed  0 0016211 sec  2  list  start -- 132 73   --  Elapsed  0 0021517 sec  3  hashset  start -- 231 26   --  Elapsed  0 003749 sec  4  hashset  end -- 368 74   --  Elapsed  0 0059776 sec  5  hashset  middle -- 385 50   --  Elapsed  0 0062493 sec  6  hashset  late -- 406 23   --  Elapsed  0 0065854 sec  7  hashset  early -- 421 34   --  Elapsed  0 0068304 sec  8  list  middle -- 18 619 12   --  Elapsed  0 3018345 sec  9  list  end -- 40 942 82   --  Elapsed  0 663724 sec  10  list  late -- 41 188 19   --  Elapsed  0 6677017 sec    ---------- Testing few ints ------------ Sample items   16 total  7266092 60668895 159021363 216428460 28007724      Benchmarks  1  hashset  early -- 100 00   --  Elapsed  0 0016211 sec  2  hashset  end -- 100 45   --  Elapsed  0 0016284 sec  3  list  early -- 101 83   --  Elapsed  0 0016507 sec  4  hashset  late -- 108 95   --  Elapsed  0 0017662 sec  5  hashset  middle -- 112 29   --  Elapsed  0 0018204 sec  6  hashset  start -- 120 33   --  Elapsed  0 0019506 sec  7  list  late -- 134 45   --  Elapsed  0 0021795 sec  8  list  start -- 136 43   --  Elapsed  0 0022117 sec  9  list  end -- 169 77   --  Elapsed  0 0027522 sec  10  list  middle -- 237 94   --  Elapsed  0 0038573 sec    ---------- Testing many ints ------------ Sample items   10357 total  370826556 569127161 101235820 792075135 270823009      Benchmarks  1  list  early -- 100 00   --  Elapsed  0 0015132 sec  2  hashset  end -- 101 79   --  Elapsed  0 0015403 sec  3  hashset  early -- 102 08   --  Elapsed  0 0015446 sec  4  hashset  middle -- 103 21   --  Elapsed  0 0015618 sec  5  hashset  late -- 104 26   --  Elapsed  0 0015776 sec  6  list  start -- 126 78   --  Elapsed  0 0019184 sec  7  hashset  start -- 130 91   --  Elapsed  0 0019809 sec  8  list  middle -- 16 497 89   --  Elapsed  0 2496461 sec  9  list  end -- 32 715 52   --  Elapsed  0 4950512 sec  10  list  late -- 33 698 87   --  Elapsed  0 5099313 sec

User · Answer

One factor your not taking into account is the robustness of the GetHashcode   function   With a perfect hash function the HashSet will clearly have better searching performance   But as the hash function diminishes so will the HashSet search time

User · Answer

It s essentially pointless to compare two structures for performance that behave differently  Use the structure that conveys the intent  Even if you say your List lt T gt  wouldn t have duplicates and iteration order doesn t matter making it comparable to a HashSet lt T gt   its still a poor choice to use List lt T gt  because its relatively less fault tolerant    That said  I will inspect some other aspects of performance     ------------ -------- ------------- ----------- ---------- ---------- -----------    Collection   Random   Containment   Insertion   Addition    Removal   Memory                     access                                                                ------------ -------- ------------- ----------- ---------- ---------- -----------    List lt T gt       O 1      O n           O n         O 1        O n        Lesser        HashSet lt T gt    O n      O 1           n a         O 1        O 1        Greater      ------------ -------- ------------- ----------- ---------- ---------- -----------     Even though addition is O 1  in both cases  it will be relatively slower in HashSet since it involves cost of precomputing hash code before storing it  The superior scalability of HashSet has a memory cost  Every entry is stored as a new object along with its hash code  This article might give you an idea

User · Answer

You can use a HybridDictionary which automaticly detects the breaking point  and accepts null-values  making it essentialy the same as a HashSet

User · Answer

Depends on what you re hashing  If your keys are integers you probably don t need very many items before the HashSet is faster  If you re keying it on a string then it will be slower  and depends on the input string   Surely you could whip up a benchmark pretty easily

User · Answer

You re looking at this wrong  Yes a linear search of a List will beat a HashSet for a small number of items  But the performance difference usually doesn t matter for collections that small  It s generally the large collections you have to worry about  and that s where you think in terms of Big-O  However  if you ve measured a real bottleneck on HashSet performance  then you can try to create a hybrid List HashSet  but you ll do that by conducting lots of empirical performance tests - not asking questions on SO

[.net] HashSet vs. List performance

Examples related to .net

Examples related to performance

Examples related to collections

Examples related to list

Examples related to hash