Shortcut & Serega | Character dialogue

Serega

Hey Shortcut, I heard you’re chasing the fastest times on every new map—do you ever think about how the same principles of timing and optimization could be applied to making code run faster? I’ve been messing around with a recursive algorithm that actually beats the classic iterative version in a few microseconds. Curious to see if a speedrun mindset could help me spot a micro‑optimization that even a seasoned coder misses.

Shortcut

Yeah, every second counts in both worlds. Drop the recursion into a profiler, look for branch mispredictions, and keep hot paths flat. If you find a tiny branch that can be eliminated, that’s a win. Let’s see the micro‑opt that even the pros didn’t notice—challenge accepted.

Serega

Sure thing, here’s a snippet that drops the branch entirely by using a bitmask trick. I took the tail‑recursive factorial, turned it into a loop, and used a multiply‑by‑mask instead of a conditional branch. The code ends up like this: int fact(int n){ int res=1; while(n>1){ int mask = -(n & 1); // all 1s if odd, 0s if even res *= (n & mask) | (~mask & 1); n--; } return res; } The `mask` trick eliminates the `if (n%2)` branch, so the CPU never stalls on a misprediction. I profiled it on a recent Intel core and it shaved a few hundred cycles per call compared to the straightforward recursive version. If you run it through `perf` or `gprof` you’ll see the branch‑less path dominating the hot loop. Let me know what the profiler says—if I can squeeze more out, I’ll send the next iteration.

Shortcut

Nice, that mask trick cuts the branch penalty, but I’d still check the loop for unrolling—maybe split the decrement into two steps so you process two numbers per iteration and keep the multiplier from clobbering the register too often. I’ll run `perf record` on a batch of 1000 calls and see if the CPU can keep the ALU fed without a cache miss. If the cycles per iteration drop further, send over the next tweak. Let's keep shaving those microseconds.

Serega

Got it, let’s unroll two iterations. That way you keep two multiplies in flight and reduce the loop overhead. Here’s a compact version: int fact_unroll(int n){ int res=1; while(n>1){ int a=n; int b=n-1; int maskA = -(a & 1); int maskB = -(b & 1); res *= (a & maskA) | (~maskA & 1); res *= (b & maskB) | (~maskB & 1); n -= 2; } return res; } I’ve aligned the two multiplications so the ALU stays busy, and the branch mask still removes the conditional. If you profile this, you should see a noticeable drop in cycles per call. Let me know the numbers, and we’ll push it to a vectorized SIMD version next.

Shortcut

Unrolled, that’s slick. Running `perf` on a 1000‑call batch, I’m seeing a ~30 % drop in cycles per call versus the single‑loop version, and the cache stays in line. Next step: push the two multiplications into a single `vmulps` on a 256‑bit register—SIMD should give another bump. Send the SIMD draft and I’ll run it through the profiler. Let's keep that edge sharp.