When running the development server - which is what you get by running app.run()
, you get a single synchronous process, which means at most 1 request is being processed at a time.
By sticking Gunicorn in front of it in its default configuration and simply increasing the number of --workers
, what you get is essentially a number of processes (managed by Gunicorn) that each behave like the app.run()
development server. 4 workers == 4 concurrent requests. This is because Gunicorn uses its included sync
worker type by default.
It is important to note that Gunicorn also includes asynchronous workers, namely eventlet
and gevent
(and also tornado
, but that's best used with the Tornado framework, it seems). By specifying one of these async workers with the --worker-class
flag, what you get is Gunicorn managing a number of async processes, each of which managing its own concurrency. These processes don't use threads, but instead coroutines. Basically, within each process, still only 1 thing can be happening at a time (1 thread), but objects can be 'paused' when they are waiting on external processes to finish (think database queries or waiting on network I/O).
This means, if you're using one of Gunicorn's async workers, each worker can handle many more than a single request at a time. Just how many workers is best depends on the nature of your app, its environment, the hardware it runs on, etc. More details can be found on Gunicorn's design page and notes on how gevent works on its intro page.