It’s simple, really — services call other services and they take actions based on the responses from those services. Sometimes, that action is a success, sometimes it’s a failure. But whether it is a success or a failure depends on if the interaction meets certain requirements. In particular, the response must be predictable, understandable and reasonable for the given situation. This is important so that the service reading the response can make appropriate decisions and not propagate garbage results. When a service gets a response it does not understand, it can take actions based on the garbage response and those actions can have dangerous side effects to your service and your application.
The New Stack
Bringing down an entire application is easy. All it takes is the failure of a single service and the entire set of services that make up the application can come crashing down like a house of cards. Just one minor error from a non-critical service can be disastrous to the entire application. There are, of course, many ways to prevent dependent services from failing. However, adding extra resiliency in non-critical services also adds complexity and cost, and sometimes it is not needed. Read the entire article today in The New Stack.
Major bug? Human error? Neither. The AWS S3 outage last week was more like a minor bug in an otherwise solid availability plan executed by AWS. Read my article at The New Stack..