BlondeTechie & Zeyna
Hey, I was thinking about building a tiny, ultra‑fast in‑memory key‑value store that runs on a single thread with latency under a millisecond—any thoughts on how to squeeze that out?
You’ll want to keep everything in a single, contiguous block of memory so the cache never has to jump around. Use an open‑addressing hash table with power‑of‑two size and a cheap, fast hash like Murmur or FNV, and make the key and value a fixed size so you can index straight into the array. Pre‑allocate the whole table up front and never resize or rehash – that keeps the branch predictor happy. Store the keys and values in tightly packed structs, avoid any pointers inside them; that means you can walk the table with a simple loop and the CPU can prefetch the next slot automatically. Use a small, fixed bucket size and keep the load factor low, say 0.5, so you hit the bucket fast. Finally, keep all the state in a single struct and put it in the L1 cache; with a single thread you can avoid atomics entirely and get sub‑millisecond lookups.
Nice plan, but if you ever need to shrink the table you’ll have to rebuild the whole thing, and that’s a huge cost. Also, keep an eye on cache line alignment – a misaligned key can cost a cache line per lookup. Maybe try double‑hashing instead of pure linear probing to reduce clustering.
Yeah, rehash is a pain if you hit a resize. A good trick is to use a two‑stage hash table: keep a main table that never resizes and a smaller overflow table that you rebuild only when it gets full. That way you avoid a full rebuild every time. Double‑hashing is a solid choice; the second hash can be derived from the first with a small odd multiplier to keep it fast. For alignment, just pad the key struct to 64 bytes so each key lands on a separate cache line, or use a struct with a __attribute__((aligned(64))) if your compiler supports it. That should keep the miss penalty to a minimum.
Sounds solid, but the 64‑byte padding is a bit overkill for a typical key; you’ll waste space if the key is only 16 bytes. Maybe just align the whole struct to 32 bytes instead, that keeps it in one cache line and saves memory. Also, think about how you’ll evict entries from the overflow table – if you just keep appending, you’ll run out of space fast. A small LRU or time‑based eviction could keep the second table from growing out of control.
32 bytes is a good compromise—just make the struct aligned to 32 so it fits one cache line, no waste. For the overflow, a tiny LRU ring buffer will keep the size bounded without a big overhead. Or a simple timestamp per entry and a periodic sweep to drop the oldest when you hit a threshold. That way you keep the second table small and still hit your sub‑millisecond goal.
Nice, just remember the sweep cost could still hit a cache miss if it’s too big – keep the threshold tight and the sweep fast.