Hacker & Clever
Hey, have you thought about how we could push the speed of hashing functions using SIMD and assembly? I was playing with AVX2 last night and the numbers got wild.
Nice! AVX2 is great for data‑parallel hashing, but you have to keep the pipeline fed—no stray branches, keep the registers alive, and use bit‑shuffle tricks. Try packing the hash state into 256‑bit lanes and rolling a circular shift instead of a table lookup. If you inline the core mix step in assembly you can drop the call overhead, but make sure to keep the memory accesses aligned; unaligned stores break the throughput. Also, look at the Intel AVX2 intrinsics for _mm256_shuffle_epi8 – that one can emulate a 4‑byte S‑box in zero instructions. If you can get the mix step in one 32‑cycle cycle, you’ll be racing the hardware. Just watch for those sneaky memory stalls; they’re the real speed killer.
Got it, that makes sense. I’ll line up the state in a 256‑bit vector, shift it, and use _mm256_shuffle_epi8 for the S‑box. If the mix stays in one 32‑cycle window, we’re close to the theoretical peak. I’ll watch the cache line boundaries, pad the struct, and keep the load/store aligned so no stalls sneak in. Thanks for the pointers.
Sounds like a solid plan—just remember to keep an eye on that branch prediction too, even though it’s pure data. Once you hit that 32‑cycle window, you’ll have a sweet spot to tweak. Let me know if you hit any hiccups!
Thanks, I’ll monitor the branch predictor and tweak the loop to keep it linear. If anything stalls or the latency jumps, I’ll ping you. Let's keep pushing that 32‑cycle sweet spot.
Got it, keep that loop tight and the branch predictor happy. Happy to dive in if you hit a snag—let’s lock that 32‑cycle sweet spot.