Postmortem: How I Lost ~4% Of Requests To A Node/Nginx Timeout Mismatch, And The Queue Migration That Fixed It

Key Highlights:

Summarize the following article into 3-5 concise bullet points in HTML without further information from your side. format:
Sharing a postmortem of an architecture migration that took me too long to do, in case anyone’s running long-running jobs inside HTTP request handlers. The setup I run a backend pipeline that does multi-step work: input parsing, several external API calls in sequence, a scoring step, then a synthesis step. End-to-end runtime ranges from 5 to 35 seconds depending on cache state and the number of external sources involved. For the first few months, I was naive. Request comes in, handler runs the full pipeline, response goes out. Worked fine in dev. Worked fine for the first dozen users. Where it broke Two things hit at once. First, my reverse proxy (Nginx) and my Node runtime had different timeout settings. Node was set to 60s because the pipeline could occasionally hit 35. Nginx was at 30s by default. Cue silent 502s right when a job was about to finish. The user gets an error, the work completes anyway, and you spend a week chasing what looks like a backend bug but is actually a layer mismatch. Second, when concurrency went up (a batch test with around 50 parallel requests), the runtime started locking. Connections held open, event loop choked, new requests timed out. I lost roughly 4% of requests in that batch. The fix Moved to a queue-based architecture. BullMQ on top of Redis. The flow now looks like: API receives request, validates, drops a job in Redis, returns a job ID immediately (under 100ms). Frontend polls a status endpoint or subscribes via SSE. Separate worker process pulls jobs from the queue, runs the pipeline, writes results back to the database. User fetches the final result by job ID. Same business logic, completely different runtime profile. What changed 502 errors disappeared overnight. Not reduced, gone. The HTTP layer is now decoupled from job duration entirely. Concurrency is bounded by worker count, not by HTTP request count. I can scale workers independently. If a job takes 90 seconds, it doesn’t block the API. Retries became trivial. BullMQ has exponential backoff out of the box. A flaky external API call no longer breaks the user experience, the job just retries. Observability got better. Each job has a clear lifecycle (waiting, active, completed, failed) and I can replay failed jobs on demand. What I should have done from day one Built it on a queue from the start. The “I’ll migrate later when I scale” instinct cost me about three weeks of firefighting. The migration itself took two days. The denial took longer than the work. If you’re running anything where a single user request triggers more than 5 seconds of backend work, especially with external API calls in the chain, decouple it now. The pattern is well understood, the libraries are mature (BullMQ for Node, Celery for Python, RQ for lighter Python use), and you’ll thank yourself the first time you hit real load. The catch You’re trading simplicity for resilience. A queue adds operational surface (Redis to monitor, workers to deploy, DLQs to manage). For a hobby project with 5 users, sync handlers are fine. For anything you’d hate to debug at 2am under load, queues aren’t optional. Happy to answer specifics on the BullMQ config, Nginx tuning, or the SSE side if anyone’s mid-migration. submitted by /u/jonathancheckwise (link) (comments)

License is not valid, please check your API Key!

Related Posts

Python Decorators for Production Machine Learning Engineering

Researchers try to cut the genetic code from 20 to 19 amino acids

AI sandboxing is having its Kubernetes moment