Stop Solving Problems. Start Anticipating Them.
A practical guide to looking around corners, at work, in your career, and with your life.
Years ago, my team was responsible for a critical service at Amazon that handled millions of requests per minute. The service was getting intermittent errors from a downstream API it called. The fix seemed obvious. I added a simple retry mechanism in the code. If a request failed, the service would immediately try again, up to three times in rapid succession.
And it worked. The intermittent errors on our dashboards vanished. Problem solved. Everyone was happy.
Until a few months later, when the downstream API had an extended outage. When their service finally started to recover, it was immediately overwhelmed and fell over again. The culprit was us. Our service, along with others, unleashed a "retry storm"—a massive, concentrated flood of requests that hammered their recovering systems back into oblivion. My "quick fix" for the small problem had created a catastrophic failure condition for a much bigger one.
We had solved the first-order problem, but we had failed to see the second-order…
Keep reading with a 7-day free trial
Subscribe to A Life Engineered to keep reading this post and get 7 days of free access to the full post archives.