top of page

DR Drill On Cloud For ISO 27001

Updated: Jul 30, 2024

DR On Cloud

Disaster recovery is a process to recover from such unforeseen, unplanned events that impact our business. There are various factors that can influence disaster recovery plans, most critical among them is physical distance and latency between the DC(Primary Datacenter) and DRC(Disaster Recovery Site).

DC and DRC sites must not reside in the same disaster zone. For example, as per ISO 27001, the DC and DRC must be at least 40 km away from each other. Some availability zones on clouds might not abide by this rule. If you have a DC/DRC requirement be sure to validate the physical distance between your cloud provider's availability zone. 


How do clouds comply with DR requirements?

As surprising as it might be, your favourite cloud provider may not have a DC/DRC compliance. In most cases they might have it in some countries with multiple regions but there will be many regions, mostly the newly launched ones where their focus would be mostly on high availability than disaster recovery.

That is why DR Drills are mandatory at least once in a year and must be conducted to test and prove the RTO/RPO numbers.

Regularly tested BCP and DR plans on evenly distributed and fully-independent sites needs to be recorded and certified by auditors especially for applications dealing with essential services like banking, healthcare etc.

This also builds confidence in in house processes.


How to measure DR compliance on Cloud?


  • RTO -> How fast can it recover 

  • RPO -> How much can it recover. 


Both measured in units of time- (mins, hrs) will tell you if the cloud you are hosting is DR compliant. 

You get DR certified only when you prove what you define as RTO & RPO. Let me explain,

If you have defined your RTO/RPO as 1hr/10mins, you must ensure you can recover from any disaster within 1hr and you can recover the data from until 10 mins since the disaster time.

RTO/RPO must be documented and in-line with your business availability and SLA requirements.

For example, if your SLA is 99.99% meaning yearly downtime of 52m 35s. So your RTO becomes approximately 1 hr, that means, in the event of a disaster, you must be capable of recovery within 1 hr.

Likewise , an RPO of lets say 5 mins, means you must recover the data up until 5 mins from when the disaster occurred, in other words there can be a data loss for 5 mins.

Good news is the RPO and RTO numbers are what you can decide. So fix only that much that you can provide or else you might default on legal and regulatory terms if you cannot prove it with actual results

Now  let's say Cloud A has two regions within a country, each region with 3 availability zones. In case of most clouds region stands for cities or states and az is referred to as the datacenters available within a city. 

So when city 1 encounters a disaster like situation which seems not recoverable within the RTO/RPO limit, services need to be switched to the DRC or disaster recovery center in a different city that is not within the disaster zone. 


How to run DR Drill on Cloud for ISO 27001 audits?


  1. DR/DRC Setup: Infrastructure must be created across two regions with multiple zones each. For cost optimization you can design an active/passive setup. You must have complete automation for any activity, so it can be brought online as soon as possible.

  2. Backups: On-line and scheduled backups or off-site backups for critical systems and data. Weekly full backup, daily diffs and 2 hourly transaction backups or better must be placed. These backups must be tested and restored regularly to prove they work.

  3. Infrastructure Failover/Recovery: Identify services that can be turned on and off on cloud. For example: you cannot turn your az on or off. So instead you will have to simulate an AZ , let’s say an instance/cluster in one az that can be turned on/off or may be a dns record. If not anything you can at least apply a firewall/security rule that will block incoming traffic in one az/region, so you can test the failover and recovery. 

  4. Application Failover/Recovery: Sync deployment, Sync data and test session switchover between endpoints. 

  5. DR Plan/Rundown: Each and every step to be performed for a DR must be documented in a detailed plan including the contact details and escalation matrix for the concerned parties. This may sound boring, but is one of the most important requirements for both auditors and also for the engineers who will be running the entire show.


Summary:

  • High availability and Disaster Recovery are different. You can achieve H/A easily with availability zones, but to be DR ready you will need cross regions distribution.

  • Applications must be evenly distributed into multiple regions that are not within same disaster prone zone or are at least 60 kms away.

  • Data must be replicated and backed up across regions. 

  • The latency between the availability zones across regions must be less than 1 ms second for any network based data transmission.

  • Always test your DR plan on the cloud. Have a well defined and detailed NDA with cloud providers on data privacy and localisations. 



If you like this article, I am sure you will find 10-Factor Infrastructure even more useful. It compiles all these tried and tested methodologies, design patterns & best practices into a complete framework for building secure, scalable and resilient modern infrastructure. 


 

Don’t let your best-selling product suffer due to an unstable, vulnerable & mutable infrastructure.




 


Thanks & Regards

Kamalika Majumder


78 views0 comments

Recent Posts

See All

コメント


Join the 10factorinfra Club

Learn about secure, scalable & sustainable modern infrastructure development & delivery.

Thank You for Subscribing!

©2024 by Staxa LLP. All Rights Reserved.

bottom of page