The Two Generals’ Problem
You can’t guarantee a state change across an unreliable network with 100% certainty. You can only increase the probability of success through acknowledgments and retries. This is why idempotent operations are the bedrock of reliable systems—if you don’t know if the command worked, you must be able to send it again without side effects.
Backpressure
If a consumer can’t keep up, the producer must slow down or the system will fail. Don’t hide the failure; surface it. In a microservices architecture, this often manifests as a “thundering herd” if not handled correctly. Using bounded queues and explicit rejection (HTTP 429 or 503) is often better than letting requests sit in a queue until they time out.
Failure Domains
Assume every component will fail. Design so that a failure in one region or one service doesn’t cascade. This means graceful degradation. If the search service is down, the user should still be able to view their profile. If the primary database is read-only, the UI should reflect that state instead of just spinning.
Fallacies of Distributed Computing
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn’t change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
Never optimize based on these assumptions. They are the most common source of “impossible” production bugs.