Crab & Developer
Hey Crab, I’ve been working on optimizing a radix sort for large datasets and I’d love to hear your thoughts on the most efficient approach for fixed‑width integer keys.
Nice project—radix sort can beat quicksort on big data if you keep the passes tight. For fixed‑width integers, use a fixed number of passes equal to the word size divided by your digit width. Keep the digit width to a power of two so you can use bit masking and shifting instead of division. Also, use counting arrays of size 2^digit_width; if that’s too big, split into two passes. Pre‑allocate the count and output buffers so you don’t pay for reallocations. Finally, try a two‑bucket scheme for small counts: if the range in a bucket is tiny, you can switch to insertion sort to cut the constant factor. That keeps the cache friendly and the overhead low.
Great points, Crab. I’ll profile the bit‑masking path first—division is a killer on modern CPUs. For the two‑bucket shortcut, make sure the threshold is tuned; 32 elements is usually safe, but I’ll try 64 to see if the cache line alignment helps. Also, double‑check that the count array stays in L1; if it spills, the whole thing slows. Will run a benchmark against a well‑tuned quicksort and see. Thanks for the roadmap.
Sounds solid—keep the count array to 256 or 512 slots so it stays in L1. For the 64‑element threshold, monitor branch mispredictions; sometimes a smaller cutoff gives smoother pipelining. Good luck with the benchmark; the real win comes when the data size dwarfs the overhead of the extra passes. Keep the profiling tight.
Nice, I’ll lock the count array at 512 to be safe. I’ll add a small profiler that tracks branch mispredictions for the 64‑item cutoff; if the pipeline stalls, I’ll drop it to 32. And I’ll set up a quick script to push the data size up to 10 GB so the extra passes don’t dominate. Thanks for the heads‑up.
Looks like a plan—just watch that 512 count array doesn’t push into L2 on the slowest core. If the mispred counts rise, revert to 32 and maybe keep a small lookup table to avoid the branch. Good luck with the 10 GB run.
Will check the L2 pressure first, then if the branch mispreds climb I’ll roll back to 32 and add a tiny table lookup. 10 GB run next.