Building resilience into applications

Storagebod has written that:

“[When new applications are deployed,]often the first contact that the infrastructure team will have is when an application is delivered to be integrated into the infrastructure and they try to get the application to meet it’s NFRs, SLAs etc.


Turning to the infrastructure to fix application problems, design flaws and oversights [in application design] should become the back-stop; yes, we will still use infrastructure to fix many problems but less often and with a greater understanding of why and what the implications are.”

I agree that it would be nice to see applications and developers bear more of the burden of ensuring they are recoverable from OR and DR perspectives. It’s worth noting though, that in the not so distant past applications that simply had to work – that were ‘carrier grade’ so to speak – would be developed on operating systems that had the necessary software ‘infrastructure’ to deliver on those NFRs. This begs the question as to why we don’t see all applications built in this way. There are a number of reasons but I’d argue that the primary one is simply that application development is more expensive than infrastructure.

Development of a new application or platform costs a lot of money. Whatever the complexity involved in developing non-functional requirements (NFRs) for things like availability, the pain involved in determining the functional requirements (or features) is far greater. Outside of a small number of edge-cases such as core software components of telecoms networks or manufacturing facilities, it does not make sense to build operational or disaster failure tolerance into the code of an application. Application developers (both internally in companies and in the wider ISV environment) focus on the functionality that delivers value to the business or will contribute to selling their product, not on replication, block-level data validation or data-recovery.

Even where developers are interested in building in resilience, they have been faced with a lack of software ‘infrastructure’ to support them. Many of those highly resilient applications in telcos, manufacturers etc. were built on operating systems or used components that provided services such as shared everything clustering and highly resilient, sharable file systems. OpenVMS which pioneered many of these services will forever be a niche product – albeit one that supports many extremely critical functions – because of the costs in terms of cash and flexibility that are paid when applications are developed on it. Building in resilience makes development (both initial and ongoing) and maintenance of applications more complex and expensive. It also means that the developer has to take responsibility for guaranteeing recoverability and who in their right mind would want to do that ;)?

Today, Oracle and MS are building a new variety of this software ‘infrastructure’ into their products but it’s only being used in a small proportion of developments. Even given the possibility that you might save some money on storage replication, people don’t seem to be using these services all that readily, for the reason I suspect, that developers (or management overhead) are still more expensive than those replication licenses.

The only way to change that situation would be (as SB notes) to make delivering resilience at the application layer simple, repeatable and manageable. That’s very much easier said than done though and the twenty-odd years of development of infrastructure resilience services is testament to that. There’s one place where the problem is being addressed though and that’s out there in the cloud….

Personally I think there is often a wider issue of integration between application and infrastructure teams that leads to situations where organisations focus on data rather than application or service recoverability. That boils down to process and in some cases (over) specialisation but it’s a question for another day I think.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s