Disaster Recovery - What You Need to Know
On July 19th, security vendor CrowdStrike, released a regular update to their widely used software. A bug in their code and how it interacts with Windows caused thousands of computers worldwide to go into an endless loop of blue screens of death. Some companies could quickly implement fixes and get their firms up and running, while others, most famously Delta Air Lines, struggled to recover and were down for several days.
Human or natural disasters can happen anytime, and organizations need to be prepared for those events. Disaster recovery can take many approaches—it’s always important to take a business-based approach to planning. While the criticality of systems is an important factor, you don’t want to spend $1 million to protect a data system that is only worth $10,000 to the business. With that in mind, there are several other factors to consider when planning for disaster:
- What is your tolerance for data loss?
- What is your tolerance for downtime?
- Where is your data hosted?
Data Loss
If you ask any CIO/CISO, “How much data are you willing to lose?” the answer is always none.
If you ask any CIO/CISO, “How much data are you willing to lose?” the answer is always none. Usually, when you follow up with the cost of building a system that supports zero data loss, the response is usually, ok, we could probably lose 5-10 minutes of data and be ok. Protecting against data loss is cheaper than protecting against downtime—you take backups more frequently and get them into another physical location.
The costs to go from five minutes of data loss to zero data loss are frequently at least 3x of the infrastructure costs, along with more complexity in your systems. However, there are always scenarios (stock trading, online retail, or gaming) where you must capture all the data, so the money is worth it. This metric is usually referred to as recovery point objective (RPO) for DR purposes. While backups are one aspect of this, you also need to consider other vectors like a ransomware attack or a fire in your data center, which could require a complete hardware replacement before starting your data restore process.
Down Time
To achieve 99.999% uptime, layers of redundant systems and complexity are required. This redundancy isn’t just a little more expensive—it can be exponentially more costly.
Asking the downtime question to an IT exec usually results in a similar response—“we can sustain 6 minutes of downtime a year”, which is the famous 99.999%, or “five nines” of availability. Unlike RPO, availability (measured by a similar metric called recovery time objective, or RTO) costs significantly more to build. To achieve 99.999% uptime, layers of redundant systems and complexity are required.
Asking the downtime question to an IT exec usually results in a similar response—“we can sustain 6 minutes of downtime a year”, which is the famous 99.999%, or “five nines” of availability. Unlike RPO, availability (measured by a similar metric called recovery time objective, or RTO) costs significantly more to build. To achieve 99.999% uptime, layers of redundant systems and complexity are required. This redundancy isn’t just a little more expensive—it can be exponentially more costly. While this may make sense for a busy online retailer who can lose hundreds of thousands of dollars even the minute they are down, many organizations do not need this level of availability and can produce high levels of availability with smaller budgets.
To Cloud, or Not to Cloud
The location where you host your data and systems data is another part of the disaster recovery story. Cloud providers have multiple data centers within geographic regions and numerous geographic areas, making it easy for your application to span multiple locations. The other aspect is that you don’t necessarily need to be concerned about buying hardware in case of a data center failure. The cloud can also offer an accessible secondary site for disaster recovery, even for organizations with their data centers. Cloud providers have their own outages—you must also plan for those. Historically, cloud outages have been limited to a single cloud provider, and a single region. If you are deploying to the cloud, you must still think about disaster recovery.
Learn How to Identify Weak Spots and Protect Your Organization
This post is the first in a series about disaster recovery. Some future topics we will explore include how to identify your weak points, understanding the concept of blast radius, and steps you can take to protect your organizations from various types of disasters. Hopefully, you are now familiar with the terms RPO and RTO and understand that the cloud can be helpful for disaster recovery while not being a panacea.
Joey D'Antoni is Principal Cloud Architect at DesignMind. He is a Microsoft Data Platform MVP. Joey blogs about technology of all kinds at joeydantoni.com and writes a monthly column for Redmond Magazine.