The above behavior is happening because of Branch prediction.
To understand branch prediction one must first understand Instruction Pipeline:
Any instruction is broken into a sequence of steps so that different steps can be executed concurrently in parallel. This technique is known as instruction pipeline and this is used to increase throughput in modern processors. To understand this better please see this example on Wikipedia.
Generally, modern processors have quite long pipelines, but for ease let's consider these 4 steps only.
4-stage pipeline in general for 2 instructions.
Moving back to the above question let's consider the following instructions:
A) if (data[c] >= 128)
/\
/ \
/ \
true / \ false
/ \
/ \
/ \
/ \
B) sum += data[c]; C) for loop or print().
Without branch prediction, the following would occur:
To execute instruction B or instruction C the processor will have to wait till the instruction A doesn't reach till EX stage in the pipeline, as the decision to go to instruction B or instruction C depends on the result of instruction A. So the pipeline will look like this.
when if condition returns true:
When if condition returns false:
As a result of waiting for the result of instruction A, the total CPU cycles spent in the above case (without branch prediction; for both true and false) is 7.
So what is branch prediction?
Branch predictor will try to guess which way a branch (an if-then-else structure) will go before this is known for sure. It will not wait for the instruction A to reach the EX stage of the pipeline, but it will guess the decision and go to that instruction (B or C in case of our example).
In case of a correct guess, the pipeline looks something like this:
If it is later detected that the guess was wrong then the partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay. The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the greater the need for a good branch predictor.
In the OP's code, the first time when the conditional, the branch predictor does not have any information to base up prediction, so the first time it will randomly choose the next instruction. Later in the for loop, it can base the prediction on the history. For an array sorted in ascending order, there are three possibilities:
Let us assume that the predictor will always assume the true branch on the first run.
So in the first case, it will always take the true branch since historically all its predictions are correct. In the 2nd case, initially it will predict wrong, but after a few iterations, it will predict correctly. In the 3rd case, it will initially predict correctly till the elements are less than 128. After which it will fail for some time and the correct itself when it sees branch prediction failure in history.
In all these cases the failure will be too less in number and as a result, only a few times it will need to discard the partially executed instructions and start over with the correct branch, resulting in fewer CPU cycles.
But in case of a random unsorted array, the prediction will need to discard the partially executed instructions and start over with the correct branch most of the time and result in more CPU cycles compared to the sorted array.