As you run your spark app on top of HDFS, according to Sandy Ryza
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.
So I believe that your first configuration is slower than third one is because of bad HDFS I/O throughput