Governance in the cloud

One of the IT department’s less official roles has been as a gatekeeper to an organisations infrastructure. The cost and time to market constraints that are sometimes imposed by internal IT can lead to applications being cancelled and even to not being proposed in the first place. By allowing the business to side-step the IT department though, cloud computing enables departments and individuals within organisations to get new applications up and running quickly and with investment largely focussed on development.

Where internal IT is imposing unreasonable delays and costs, this is going to be great for businesses. There are some major caveats to add though. In particular, a lot of the governance and ‘red-tape’ that internal IT seems to impose is actually about protecting an organisation’s data. By checking that things like backup and recovery have been considered and planned for, IT ensures that an organisation’s data, reputation and ultimately it’s business are protected. Where those checks are bypassed, it is fair to expect that the ‘boring’ aspects of application development and deployment will not get the attention the really require. The litany of data loss horror stories never seems to abate. Cloud computing service providers may provide the tools to implement effective backups, but that won’t guarantee that developers will use them.

To be clear, the threat here is not that organisations will use cloud computing, which will be a great addition to the IT tool-box. The threat is the same as that posed by applications running on servers sitting under people’s desks; It is the same thread as that posed by data that leaks on portable drives; The threat is that broken governance can lead to no governance and that organisations will be compromised as a result.

The solution is for internal IT and their management to build cloud computing into their governance and release management models. In much the same way as for suppliers of physical infrastructure, organisations need to choose their suppliers and build standards for development and deployment . By doing this, they can ensure that all applications, whether hosted internally or in the cloud are checked to ensure compliance with data protection, availability and security requirements.

There’s something else to say here though and that’s to remember quite how much due diligence vendors of physical infrastructure are put through before purchase decisions are made. Ultimately, even an SLA isn’t really enough unless you are convinced that the organisation to which you are trusting your data is able to follow through on their promises. I wonder what the cloud services RFP equivalent of a double disk pull will be?

Building resilience into applications

Storagebod has written that:

“[When new applications are deployed,]often the first contact that the infrastructure team will have is when an application is delivered to be integrated into the infrastructure and they try to get the application to meet it’s NFRs, SLAs etc.

[…..]

Turning to the infrastructure to fix application problems, design flaws and oversights [in application design] should become the back-stop; yes, we will still use infrastructure to fix many problems but less often and with a greater understanding of why and what the implications are.”

I agree that it would be nice to see applications and developers bear more of the burden of ensuring they are recoverable from OR and DR perspectives. It’s worth noting though, that in the not so distant past applications that simply had to work – that were ‘carrier grade’ so to speak – would be developed on operating systems that had the necessary software ‘infrastructure’ to deliver on those NFRs. This begs the question as to why we don’t see all applications built in this way. There are a number of reasons but I’d argue that the primary one is simply that application development is more expensive than infrastructure.

Development of a new application or platform costs a lot of money. Whatever the complexity involved in developing non-functional requirements (NFRs) for things like availability, the pain involved in determining the functional requirements (or features) is far greater. Outside of a small number of edge-cases such as core software components of telecoms networks or manufacturing facilities, it does not make sense to build operational or disaster failure tolerance into the code of an application. Application developers (both internally in companies and in the wider ISV environment) focus on the functionality that delivers value to the business or will contribute to selling their product, not on replication, block-level data validation or data-recovery.

Even where developers are interested in building in resilience, they have been faced with a lack of software ‘infrastructure’ to support them. Many of those highly resilient applications in telcos, manufacturers etc. were built on operating systems or used components that provided services such as shared everything clustering and highly resilient, sharable file systems. OpenVMS which pioneered many of these services will forever be a niche product – albeit one that supports many extremely critical functions – because of the costs in terms of cash and flexibility that are paid when applications are developed on it. Building in resilience makes development (both initial and ongoing) and maintenance of applications more complex and expensive. It also means that the developer has to take responsibility for guaranteeing recoverability and who in their right mind would want to do that ;)?

Today, Oracle and MS are building a new variety of this software ‘infrastructure’ into their products but it’s only being used in a small proportion of developments. Even given the possibility that you might save some money on storage replication, people don’t seem to be using these services all that readily, for the reason I suspect, that developers (or management overhead) are still more expensive than those replication licenses.

The only way to change that situation would be (as SB notes) to make delivering resilience at the application layer simple, repeatable and manageable. That’s very much easier said than done though and the twenty-odd years of development of infrastructure resilience services is testament to that. There’s one place where the problem is being addressed though and that’s out there in the cloud….

Personally I think there is often a wider issue of integration between application and infrastructure teams that leads to situations where organisations focus on data rather than application or service recoverability. That boils down to process and in some cases (over) specialisation but it’s a question for another day I think.