For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run the loop until n = 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
If we do the loop until n = 1,000,000 and have a precomputed table how many iterations are needed from any start point n = 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.