On a rather unrelated note: more performance hacks!
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N
(shown first):
To leap past these 2 elements means to compute (N >> 1) + N + 1
, ((N << 1) + N + 1) >> 1
and N >> 2
, respectively.
Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1
.
Case (1) is obvious. Case (2) implies (N & 1) == 1
, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba
from most- to least-significant, then a = 1
, and the following holds:
(N << 1) + N + 1: (N >> 1) + N + 1:
b10 b1
b1 b
+ 1 + 1
---- ---
bBb0 bBb
where B = !b
. Right-shifting the first result gives us exactly what we want.
Q.E.D.: (N & 1) == 1 ? (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1
.
As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
The resulting algorithm looks like this:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Here we compare n > 2
because the process may stop at 2 instead of 1 if the total length of the sequence is odd.
Let`s translate this into assembly!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
@main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
@loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA @loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA @main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
@itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE @itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE @itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
@outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE @outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Use these commands to compile:
nasm -f elf64 file.asm
ld -o file file.o
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)