Disaster Recovery in the Cloud: Key points to discuss with your DR Service vendor

Disaster Recovery configuration is not a thrilling exercise for IT, however, considering the heavy reliance of the modern enterprise on online systems, the ability of the organization to operate in case of serious incidents should be of highest priority. A managed service provider is often the most practical option. The typical DR architecture will invariably involve the setup of a primary production site and remote secondary recovery site, both intertwined in a complex interdependency. Ideally, well defined Service-Level Agreements will spell out the important details related to the actual recovery process. In order to negotiate the best terms for your business, consider the following key points, which form the backbone of the SLA documentation.

Infrastructure Availability

Needless to say, if your cloud-based secondary site is promoted to primary, the success of the recovery event depends on the availability of the cloud infrastructure. The uptime SLA is usually indicated as a percentage, i.e. 99.99%. For many applications an uptime SLA of 99% may not be adequate, because this allows for too much downtime, while 100% on the other hand may just turn out to be impossible. The SLA may also spell out different uptime levels for different parts of the day (e.g. 99.99% during work hours, and 99.95% during nights and weekends).

Replication Service

Another critical SLA when it comes to disaster recovery in the cloud is the guaranteed continuous replication service for the secondary site. Even if the cloud infrastructure is setup with adequate availability, any failure during the replication process can compromise the integrity of the stand-by backup systems and inhibit the recovery process in case of a DR event.

Recovery Team Response

The Recovery Team Response SLA specifies how quickly the vendor’s specialist team will be available to bring online the backup site in case of an incident. Clearly, if the infrastructure hardware is in place and all data is fully replicated, a delayed response by the DRaaS team would be the weakest link in the recovery chain. An SLA is required to ensure that the managed service DRaaS vendor will be there in case of a DR event.

Recovery Time Objective (RTO)

The Recovery Time Objective, or RTO, specifies the elapsed time between the DR event and the moment of recovery. DRaaS vendors vary greatly when it comes to this critical business continuity metric. RTO greatly informs the customer’s expectations of the recovery of data, and it is important that there is a clear understanding of the vendor’s RTO definition and service conditions. There are a lot of recovery activities that must be performed by the vendor, depending on specific client architecture, before the customer systems resume normal operation, and the SLA must specify the exact level of normalcy to be expected.

SLA Service Credits

Vendors typically offer SLA service credits if they fail to fulfil their commitments. This works for both sides – customers get some sort of compensation, while vendors get to limit their liability and avoid compensating customers in full for a breach of contract. Customers are usually required to file a claim for these credits in order to receive them, often within a specific time limit.