Mark & LayerCrafter
Stumbled on a race condition in our legacy API that only triggers on a specific request pattern. Think you’d be up for digging into it?
Sure, but I’m not a quick‑fix person. Send me the exact request sequence, the code that handles it, and your current locking scheme. Once we can step through the shared state access, we’ll apply the proper synchronization and re‑run the test. Keep the logs granular so we catch the exact moment the race slips in. If it starts emailing me about itself, we’ll know it solved itself.
Here’s the minimal path that reproduces it:
1. GET /api/status (first read)
2. POST /api/start (creates a job and stores a UUID)
3. GET /api/job/<uuid> (second read, expects the job to exist)
4. POST /api/finish/<uuid> (sets status to finished)
The handler is in job.go:
func handleStart(w http.ResponseWriter, r *http.Request) {
id := uuid.New().String()
job := &Job{ID: id, State: “running”}
mu.Lock()
jobs[id] = job
mu.Unlock()
json.NewEncoder(w).Encode(job)
}
func handleStatus(w http.ResponseWriter, r *http.Request) {
mu.Lock()
defer mu.Unlock()
// read jobs map
}
The lock `mu` is a simple sync.Mutex. The race shows up when the POST /api/start happens right after the GET /api/status but before the GET /api/job. The job entry isn’t in the map yet, so the second read blocks until the lock is released, but the timing lets the status read finish early, then the job read sees a nil pointer. Tightening the lock around the map read or adding a read‑write lock would stop it. Also switch to sync.RWMutex and wrap the status read with RLock to reduce contention. Keep your logs inside the lock and print the timestamp and goroutine ID so you see exactly when the map is accessed. That’s all you need to see the glitch and fix it.
Use a RWMutex and RLock for the status handler, Lock for writes, then guard the job read with a nil check. That eliminates the window where the read sees a nil entry. Add a small log after the map lookup to confirm the timing, and make sure the job ID is validated before writing to the response. That’s all you need to nail the race.
Sounds good. I’ll swap the mutex, add the nil guard, and drop the timestamp after each lookup. Let’s hit run.
Good plan. Just double‑check that the nil guard runs before you dereference the job, otherwise you’ll still hit the panic. And keep the RWMutex in sync across all handlers—no mixing locks or you’ll get a subtle deadlock. Once that’s in place, the race should disappear. Let’s see the results.
Yeah, I’ll put the nil check right before the dereference and make sure every handler uses the same RWMutex. Once that’s in place we should see the crash disappear and the logs give us the exact timeline. Let's run it and see the clean pass.
Nice, that should clean up the timing hole. Once you run it, the logs will confirm there’s no longer a nil dereference and the sequence will finish without a crash. If it still stumbles, we’ll dig deeper into the lock ordering, but it sounds like the race is gone. Let’s fire it up.
Got it. I’ll fire the test, dump the timestamps, and we’ll see if the panic is still a thing. If it’s gone, we’re done; if not, we’ll hunt the lock order again. Let's run it.
Run it, and if the panic still shows up, make sure no other goroutine is still holding the write lock when you read the status. If the timestamps line up, the race should be gone. If not, we’ll look for a hidden write that slips past the RWMutex. But most likely the nil guard and proper lock type will do the trick. Good luck.
Alright, patch applied and test queue cleared. I’m hitting the sequence now and pulling the logs straight from the server output. If a nil dereference shows up, I’ll know a lock slipped. Otherwise it should just finish cleanly. Let's see what the timestamps say.