[multithreading] How many threads is too many?

I am writing a server, and I send each action of into a separate thread when the request is received. I do this because almost every request makes a database query. I am using a threadpool library to cut down on construction/destruction of threads.

My question is: what is a good cutoff point for I/O threads like these? I know it would just be a rough estimate, but are we talking hundreds? Thousands?

How would I go about figuring out what this cutoff would be?


EDIT:

Thank you all for your responses, it seems like I am just going to have to test it to find out my thread count ceiling. The question is though: how do I know I've hit that ceiling? What exactly should I measure?

This question is related to multithreading performance threadpool

The answer is


If your threads are performing any kind of resource-intensive work (CPU/Disk) then you'll rarely see benefits beyond one or two, and too many will kill performance very quickly.

The 'best-case' is that your later threads will stall while the first ones complete, or some will have low-overhead blocks on resources with low contention. Worst-case is that you start thrashing the cache/disk/network and your overall throughput drops through the floor.

A good solution is to place requests in a pool that are then dispatched to worker threads from a thread-pool (and yes, avoiding continuous thread creation/destruction is a great first step).

The number of active threads in this pool can then be tweaked and scaled based on the findings of your profiling, the hardware you are running on, and other things that may be occurring on the machine.


In most cases you should allow the thread pool to handle this. If you post some code or give more details it might be easier to see if there is some reason the default behavior of the thread pool would not be best.

You can find more information on how this should work here: http://en.wikipedia.org/wiki/Thread_pool_pattern


As Pax rightly said, measure, don't guess. That what I did for DNSwitness and the results were suprising: the ideal number of threads was much higher than I thought, something like 15,000 threads to get the fastest results.

Of course, it depends on many things, that's why you must measure yourself.

Complete measures (in French only) in Combien de fils d'exécution ?.


This question has been discussed quite thoroughly and I didn't get a chance to read all the responses. But here's few things to take into consideration while looking at the upper limit on number of simultaneous threads that can co-exist peacefully in a given system.

  1. Thread Stack Size : In Linux the default thread stack size is 8MB (you can use ulimit -a to find it out).
  2. Max Virtual memory that a given OS variant supports. Linux Kernel 2.4 supports a memory address space of 2 GB. with Kernel 2.6 , I a bit bigger (3GB )
  3. [1] shows the calculations for the max number of threads per given Max VM Supported. For 2.4 it turns out to be about 255 threads. for 2.6 the number is a bit larger.
  4. What kindda kernel scheduler you have . Comparing Linux 2.4 kernel scheduler with 2.6 , the later gives you a O(1) scheduling with no dependence upon number of tasks existing in a system while first one is more of a O(n). So also the SMP Capabilities of the kernel schedule also play a good role in max number of sustainable threads in a system.

Now you can tune your stack size to incorporate more threads but then you have to take into account the overheads of thread management(creation/destruction and scheduling). You can enforce CPU Affinity to a given process as well as to a given thread to tie them down to specific CPUs to avoid thread migration overheads between the CPUs and avoid cold cash issues.

Note that one can create thousands of threads at his/her wish , but when Linux runs out of VM it just randomly starts killing processes (thus threads). This is to keep the utility profile from being maxed out. (The utility function tells about system wide utility for a given amount of resources. With a constant resources in this case CPU Cycles and Memory, the utility curve flattens out with more and more number of tasks ).

I am sure windows kernel scheduler also does something of this sort to deal with over utilization of the resources

[1] http://adywicaksono.wordpress.com/2007/07/10/i-can-not-create-more-than-255-threads-on-linux-what-is-the-solutions/


ryeguy, I am currently developing a similar application and my threads number is set to 15. Unfortunately if I increase it at 20, it crashes. So, yes, I think the best way to handle this is to measure whether or not your current configuration allows more or less than a number X of threads.


As many threads as the CPU cores is what I've heard very often.


One thing to consider is how many cores exist on the machine that will be executing the code. That represents a hard limit on how many threads can be proceeding at any given time. However, if, as in your case, threads are expected to be frequently waiting for a database to execute a query, you will probably want to tune your threads based on how many concurrent queries the database can process.


As many threads as the CPU cores is what I've heard very often.


The "big iron" answer is generally one thread per limited resource -- processor (CPU bound), arm (I/O bound), etc -- but that only works if you can route the work to the correct thread for the resource to be accessed.

Where that's not possible, consider that you have fungible resources (CPUs) and non-fungible resources (arms). For CPUs it's not critical to assign each thread to a specific CPU (though it helps with cache management), but for arms, if you can't assign a thread to the arm, you get into queuing theory and what's optimal number to keep arms busy. Generally I'm thinking that if you can't route requests based on the arm used, then having 2-3 threads per arm is going to be about right.

A complication comes about when the unit of work passed to the thread doesn't execute a reasonably atomic unit of work. Eg, you may have the thread at one point access the disk, at another point wait on a network. This increases the number of "cracks" where additional threads can get in and do useful work, but it also increases the opportunity for additional threads to pollute each other's caches, etc, and bog the system down.

Of course, you must weigh all this against the "weight" of a thread. Unfortunately, most systems have very heavyweight threads (and what they call "lightweight threads" often aren't threads at all), so it's better to err on the low side.

What I've seen in practice is that very subtle differences can make an enormous difference in how many threads are optimal. In particular, cache issues and lock conflicts can greatly limit the amount of practical concurrency.


If your threads are performing any kind of resource-intensive work (CPU/Disk) then you'll rarely see benefits beyond one or two, and too many will kill performance very quickly.

The 'best-case' is that your later threads will stall while the first ones complete, or some will have low-overhead blocks on resources with low contention. Worst-case is that you start thrashing the cache/disk/network and your overall throughput drops through the floor.

A good solution is to place requests in a pool that are then dispatched to worker threads from a thread-pool (and yes, avoiding continuous thread creation/destruction is a great first step).

The number of active threads in this pool can then be tweaked and scaled based on the findings of your profiling, the hardware you are running on, and other things that may be occurring on the machine.


I think this is a bit of a dodge to your question, but why not fork them into processes? My understanding of networking (from the hazy days of yore, I don't really code networks at all) was that each incoming connection can be handled as a separate process, because then if someone does something nasty in your process, it doesn't nuke the entire program.


In most cases you should allow the thread pool to handle this. If you post some code or give more details it might be easier to see if there is some reason the default behavior of the thread pool would not be best.

You can find more information on how this should work here: http://en.wikipedia.org/wiki/Thread_pool_pattern


The "big iron" answer is generally one thread per limited resource -- processor (CPU bound), arm (I/O bound), etc -- but that only works if you can route the work to the correct thread for the resource to be accessed.

Where that's not possible, consider that you have fungible resources (CPUs) and non-fungible resources (arms). For CPUs it's not critical to assign each thread to a specific CPU (though it helps with cache management), but for arms, if you can't assign a thread to the arm, you get into queuing theory and what's optimal number to keep arms busy. Generally I'm thinking that if you can't route requests based on the arm used, then having 2-3 threads per arm is going to be about right.

A complication comes about when the unit of work passed to the thread doesn't execute a reasonably atomic unit of work. Eg, you may have the thread at one point access the disk, at another point wait on a network. This increases the number of "cracks" where additional threads can get in and do useful work, but it also increases the opportunity for additional threads to pollute each other's caches, etc, and bog the system down.

Of course, you must weigh all this against the "weight" of a thread. Unfortunately, most systems have very heavyweight threads (and what they call "lightweight threads" often aren't threads at all), so it's better to err on the low side.

What I've seen in practice is that very subtle differences can make an enormous difference in how many threads are optimal. In particular, cache issues and lock conflicts can greatly limit the amount of practical concurrency.


As Pax rightly said, measure, don't guess. That what I did for DNSwitness and the results were suprising: the ideal number of threads was much higher than I thought, something like 15,000 threads to get the fastest results.

Of course, it depends on many things, that's why you must measure yourself.

Complete measures (in French only) in Combien de fils d'exécution ?.


I think this is a bit of a dodge to your question, but why not fork them into processes? My understanding of networking (from the hazy days of yore, I don't really code networks at all) was that each incoming connection can be handled as a separate process, because then if someone does something nasty in your process, it doesn't nuke the entire program.


ryeguy, I am currently developing a similar application and my threads number is set to 15. Unfortunately if I increase it at 20, it crashes. So, yes, I think the best way to handle this is to measure whether or not your current configuration allows more or less than a number X of threads.


One thing to consider is how many cores exist on the machine that will be executing the code. That represents a hard limit on how many threads can be proceeding at any given time. However, if, as in your case, threads are expected to be frequently waiting for a database to execute a query, you will probably want to tune your threads based on how many concurrent queries the database can process.


I think this is a bit of a dodge to your question, but why not fork them into processes? My understanding of networking (from the hazy days of yore, I don't really code networks at all) was that each incoming connection can be handled as a separate process, because then if someone does something nasty in your process, it doesn't nuke the entire program.


As Pax rightly said, measure, don't guess. That what I did for DNSwitness and the results were suprising: the ideal number of threads was much higher than I thought, something like 15,000 threads to get the fastest results.

Of course, it depends on many things, that's why you must measure yourself.

Complete measures (in French only) in Combien de fils d'exécution ?.


This question has been discussed quite thoroughly and I didn't get a chance to read all the responses. But here's few things to take into consideration while looking at the upper limit on number of simultaneous threads that can co-exist peacefully in a given system.

  1. Thread Stack Size : In Linux the default thread stack size is 8MB (you can use ulimit -a to find it out).
  2. Max Virtual memory that a given OS variant supports. Linux Kernel 2.4 supports a memory address space of 2 GB. with Kernel 2.6 , I a bit bigger (3GB )
  3. [1] shows the calculations for the max number of threads per given Max VM Supported. For 2.4 it turns out to be about 255 threads. for 2.6 the number is a bit larger.
  4. What kindda kernel scheduler you have . Comparing Linux 2.4 kernel scheduler with 2.6 , the later gives you a O(1) scheduling with no dependence upon number of tasks existing in a system while first one is more of a O(n). So also the SMP Capabilities of the kernel schedule also play a good role in max number of sustainable threads in a system.

Now you can tune your stack size to incorporate more threads but then you have to take into account the overheads of thread management(creation/destruction and scheduling). You can enforce CPU Affinity to a given process as well as to a given thread to tie them down to specific CPUs to avoid thread migration overheads between the CPUs and avoid cold cash issues.

Note that one can create thousands of threads at his/her wish , but when Linux runs out of VM it just randomly starts killing processes (thus threads). This is to keep the utility profile from being maxed out. (The utility function tells about system wide utility for a given amount of resources. With a constant resources in this case CPU Cycles and Memory, the utility curve flattens out with more and more number of tasks ).

I am sure windows kernel scheduler also does something of this sort to deal with over utilization of the resources

[1] http://adywicaksono.wordpress.com/2007/07/10/i-can-not-create-more-than-255-threads-on-linux-what-is-the-solutions/


As many threads as the CPU cores is what I've heard very often.


One thing to consider is how many cores exist on the machine that will be executing the code. That represents a hard limit on how many threads can be proceeding at any given time. However, if, as in your case, threads are expected to be frequently waiting for a database to execute a query, you will probably want to tune your threads based on how many concurrent queries the database can process.


I've written a number of heavily multi-threaded apps. I generally allow the number of potential threads to be specified by a configuration file. When I've tuned for specific customers, I've set the number high enough that my utilization of the all the CPU cores was pretty high, but not so high that I ran into memory problems (these were 32-bit operating systems at the time).

Put differently, once you reach some bottleneck be it CPU, database throughput, disk throughput, etc, adding more threads won't increase the overall performance. But until you hit that point, add more threads!

Note that this assumes the system(s) in question are dedicated to your app, and you don't have to play nicely (avoid starving) other apps.


One thing you should keep in mind is that python (at least the C based version) uses what's called a global interpreter lock that can have a huge impact on performance on mult-core machines.

If you really need the most out of multithreaded python, you might want to consider using Jython or something.


If your threads are performing any kind of resource-intensive work (CPU/Disk) then you'll rarely see benefits beyond one or two, and too many will kill performance very quickly.

The 'best-case' is that your later threads will stall while the first ones complete, or some will have low-overhead blocks on resources with low contention. Worst-case is that you start thrashing the cache/disk/network and your overall throughput drops through the floor.

A good solution is to place requests in a pool that are then dispatched to worker threads from a thread-pool (and yes, avoiding continuous thread creation/destruction is a great first step).

The number of active threads in this pool can then be tweaked and scaled based on the findings of your profiling, the hardware you are running on, and other things that may be occurring on the machine.


One thing you should keep in mind is that python (at least the C based version) uses what's called a global interpreter lock that can have a huge impact on performance on mult-core machines.

If you really need the most out of multithreaded python, you might want to consider using Jython or something.


One thing to consider is how many cores exist on the machine that will be executing the code. That represents a hard limit on how many threads can be proceeding at any given time. However, if, as in your case, threads are expected to be frequently waiting for a database to execute a query, you will probably want to tune your threads based on how many concurrent queries the database can process.


ryeguy, I am currently developing a similar application and my threads number is set to 15. Unfortunately if I increase it at 20, it crashes. So, yes, I think the best way to handle this is to measure whether or not your current configuration allows more or less than a number X of threads.


I've written a number of heavily multi-threaded apps. I generally allow the number of potential threads to be specified by a configuration file. When I've tuned for specific customers, I've set the number high enough that my utilization of the all the CPU cores was pretty high, but not so high that I ran into memory problems (these were 32-bit operating systems at the time).

Put differently, once you reach some bottleneck be it CPU, database throughput, disk throughput, etc, adding more threads won't increase the overall performance. But until you hit that point, add more threads!

Note that this assumes the system(s) in question are dedicated to your app, and you don't have to play nicely (avoid starving) other apps.


ryeguy, I am currently developing a similar application and my threads number is set to 15. Unfortunately if I increase it at 20, it crashes. So, yes, I think the best way to handle this is to measure whether or not your current configuration allows more or less than a number X of threads.


As Pax rightly said, measure, don't guess. That what I did for DNSwitness and the results were suprising: the ideal number of threads was much higher than I thought, something like 15,000 threads to get the fastest results.

Of course, it depends on many things, that's why you must measure yourself.

Complete measures (in French only) in Combien de fils d'exécution ?.


As many threads as the CPU cores is what I've heard very often.


In most cases you should allow the thread pool to handle this. If you post some code or give more details it might be easier to see if there is some reason the default behavior of the thread pool would not be best.

You can find more information on how this should work here: http://en.wikipedia.org/wiki/Thread_pool_pattern


Examples related to multithreading

How can compare-and-swap be used for a wait-free mutual exclusion for any shared data structure? Waiting until the task finishes What is the difference between Task.Run() and Task.Factory.StartNew() Why is setState in reactjs Async instead of Sync? What exactly is std::atomic? Calling async method on button click WAITING at sun.misc.Unsafe.park(Native Method) How to use background thread in swift? What is the use of static synchronized method in java? Locking pattern for proper use of .NET MemoryCache

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to threadpool

Thread pooling in C++11 If my interface must return Task what is the best way to have a no-operation implementation? How can I shutdown Spring task executor/scheduler pools before all other beans in the web app are destroyed? How to increase number of threads in tomcat thread pool? Naming threads and thread-pools of ExecutorService How to get thread id from a thread pool? ExecutorService, how to wait for all tasks to finish How many threads is too many?