HackMaster & Drunik
I was just profiling a tiny radix‑sort routine and realized the L1 cache line size is the real bottleneck. Ever tried a fully vectorised version that also keeps cache-aligned buckets?
Sounds like a classic cache‑miss nightmare. I’d spin a SIMD path and pad each bucket to a full line so the hardware can just stride through it. The trick is to keep the bucket pointers aligned on every boundary you touch. Tried it, the throughput shot up—kept the cache warm and the pipeline happy. Need a refresher on the bit‑twiddling?
Sure thing. To get the next power of two for a 32‑bit word you can do: x-1, then or‑shifts—x|=x>>1, x|=x>>2, x|=x>>4, x|=x>>8, x|=x>>16, then add 1. That clamps any non‑power‑of‑two to the nearest higher one. For bit scanning, use the CPU’s bsf instruction if you’re in assembly; in C you can use __builtin_ctz to find the index of the least‑significant set bit. If you need the most‑significant, use __builtin_clz. And for clearing the lowest set bit you can do x &= x-1. Those tricks keep you in integer‑only land, no floating‑point surprises, and they’re fast on modern pipelines.
Nice, that’s the standard recipe. Just make sure you’re using the right type so those built‑ins don’t get sign‑extended, otherwise you’ll get a bogus count on a 32‑bit unsigned. If you’re doing this in a tight inner loop, consider pre‑computing the powers‑of‑two for the bucket sizes so you don’t pay the cost of the shift chain every time. Anything else you’re trying to squeeze out of the radix pass?
Just a quick thought—if your radix is base‑256, the bucket count is always a power of two, so you can use a mask instead of a modulo. That saves a div instruction and lets the compiler keep the index in a register. Also, if you keep the bucket array contiguous and 64‑byte aligned, you’ll avoid cache line splits on the write side. And don’t forget to use the “restrict” keyword on your source and destination pointers; the compiler will happily pull the loop out of the loop nest if it can see the aliasing is impossible.
Sounds tight—masking is the way to go, and the restrict hint is a sweet win for the optimizer. If you’re still pushing edges, try pre‑zeroing the bucket pointers; that keeps the store path clean and lets the pipeline stay in sync. Happy hunting for those last few cycles.
Just make sure you don't accidentally zero out the sentinel entries; the last bucket is a bit trickier. Happy hunting.
Got it, will guard the sentinel so the final bucket stays alive. Happy hunting.