Reliability and Resilience

The Difference Between Reliable and Resilient Software

And why it matters

Digital product objectives and priorities differ from company to company, but one thing remains: everyone wants their apps to work. As the user, you want to be able to do things easily and conveniently. As a product owner or provider of a service, you want your customers to be able to do what they need to do when they want to do it. That requires applications that are reliable and resilient. Resiliency isn’t just a buzzword anymore — it’s a key component of reliability.

The Institute of Electrical and Electronics Engineers (IEEE) Reliability Society defines reliability as “the probability of failure-free software operation for a specified period of time in a specified environment.” A reliable app functions just as the designer intended it to whenever and wherever a customer is connected. But, that doesn’t mean that every component of the app has to be absolutely flawless all of the time, which leads us to the difference between reliability and resiliency.

Reliability: The target at which software designers have always aimed: perfect operation all the time. Reliability is the planned outcome.

 
Resiliency: The ability of an app to recover from certain types of failure and yet remain functional from the customer perspective. Resilience is how you achieve the outcome. Resiliency can also be called Recoverability.

All applications have a risk of a single feature or function causing cascading effects on its function or availability. For example, an update to the cloud-based address book used in an app could cause it to fail, while the remainder of the app performs as designed. Resiliency, in this case, could be achieved by rendering a cached static version of the address book if the real-time or data-driven version fails.

Resilience in this context means that failures must be compartmentalized. The failure of one feature should not cause the failure of other features. When one feature — like an address book — is temporarily unavailable, the rest of the application still runs. After the failure has been contained, an instruction set activates, restarting the failing component. These steps need to be automatic, immediate, and reliable. When the component’s functionality has been restored, normal collaboration with other components can resume.

Software is reliable if its mission-critical features are able to recover from failure.

 
CabForward believes strongly in writing software that doesn’t suck. We focus on resilience in our digital product planning cycle because we accept that failure is a fundamental part of the programming model. We build software to be rugged so that it can continue to operate under adverse or hostile conditions. Why? Because if your app quits working, it is your customer who gets frustrated and that could adversely affect your livelihood.

Read More:
Rugged Software Rugged Manifesto

Selecting the Right Digital App Developer for You