October 30, 2024 Joey D'Antoni

Disaster Recovery - Where are Your Weak Points?

4:10

Disaster recovery is a complex, multi-faceted part of IT.

It’s not enough for the database team to be able to failover to a second site and restore from backup, if the application team can’t do the same thing.

Likewise, it's not enough to protect your server or cloud environment from a natural disaster, if you don’t have a contingency plan for your employees to work in a second site or remotely in the same circumstances. This complexity is why disaster recovery is challenging.

When IT professionals speak of disasters, we think of a few different types of events—let’s highlight those, and look at each event type:

1. Natural Disasters (hurricane, earthquake, fire)
2. Ransomware attack/security event
3. Significant cloud outage (may be in conjunction with 1 or 2)
4. Major software event (e.g. CrowdStrike) affecting large numbers of computers

Natural Disasters

Many organizations based in areas where these disasters are prevalent choose to place their data centers in safer, remote locations to reduce risk.

I grew up in New Orleans and spent the first decade of my career in North Carolina. I have supported sites in Puerto Rico, so I am all too familiar with IT disaster recovery planning for tropical storms and hurricanes. While those aren’t the only kinds of natural disasters, they are highly damaging. They can cause not just damage to your data center and office but also create lasting business disruption. Likewise, fire and earthquakes provide similar concerns. These disasters have limited planning windows and disrupt your employees and their belongings, so they need to be included in your plans.

Many organizations based in areas where these disasters are prevalent choose to place their data centers in safer, remote locations to reduce risk. However, given the impacts of climate change around storms and fire, finding a single secure location is difficult. You may choose a secondary data center or cloud region for additional protection for critical systems.

Ransomware

In a recent security report, Microsoft reported that ransomware attacks tripled last year.

In a recent security report, Microsoft reported that ransomware attacks tripled last year. While their success rate has decreased because of better security and protection tooling, these security threats are a never-ending concern for every industry.

Like natural disasters, ransomware attacks provide little warning—having a plan is crucial.

Ideally, that plan protects techniques like lateral movement that allow a threat actor to move unchecked throughout your network. The secondary defense here is always backups; you must ensure that your backups are immutable (which means they can’t be overwritten or changed) and air-gapped from the rest of your network. Restoring from backup is a slow operation—you should consider testing your complete restore process regularly. Hence, you have a good idea of the timing and evaluate storage solutions to help expedite your IT disaster recovery process.

Major Cloud Outages

This blast radius means your most critical apps should be multi-region, especially if the value the application adds is more than the cost of that redundancy.

One of the things I like to tell clients as they move into the cloud is that they will likely have higher uptime overall. Still, if there is an outage, it will likely be longer on average than an on-premises outage. The reasons for this are complicated, but the main one is that much redundancy is built into the major cloud providers, and the architecture is incredibly complex. That means when they do break something, it usually isn’t as easy as turning it off and back on again.

I’ve researched major cloud outages over time—the most common pattern is that a specific service gets taken down in a particular region. The one caveat is that if the service that goes down has other services dependent on it, it can take down a region—such a thing happened with a U.S. East Amazon Kinesis outage several years ago. This blast radius means your most critical apps should be multi-region, especially if the value the application adds is more than the cost of that redundancy.

Interestingly, there haven’t been many outages where having your data in multiple availability zones would have protected your workloads. However, there are many reasons to want that data split across multiple data centers like availability zones do—more on that in my next post.

CrowdStrike

business documents on office table with smart phone and laptop computer and graph financial with social network diagram and three colleagues discussing data in the background x 500-2 When SQL Server came out on Linux, and Microsoft highlighted that you could run an Always On Availability Group across OS platforms, I wondered why anyone would want to add that complexity to their environment. Then, I thought about what would happen if some Windows update broke all Windows systems everywhere at once. And then I thought I was stupid because there was no way Microsoft would let that happen—maybe those Linux people might break something.

Last July, CrowdStrike broke all the things.

I still view this as a low probability, black swan event, but if you are mapping out risks and trying to understand where you could have failures, this is a good exercise to work through, just because of how impactful the event was.

Your plan needs to be about more than your computers, and cover things like contingency locations and employees to sustain your operations if your home area is greatly impacted.

Disasters happen, and natural disasters are becoming more frequent as weather changes. You might notice a trend throughout these topics, which is the importance of having a plan well an advance of any potential disaster. Your plan needs to be about more than your computers, and cover things like contingency locations and employees to sustain your operations if your home area is greatly impacted.

As I wrote at the beginning of this article, disaster recovery is a complex, multi-faceted operation. You need an up to date strategy. If you can’t remember the last time you tested restoring your backups, it’s probably a good time to do that. Remember the motto of the Boy Scouts: “Be Prepared”.

This post is the second in a series about disaster recovery.

Joey D'Antoni is Principal Cloud Architect at DesignMind. He is a Microsoft Data Platform MVP. Joey blogs about technology of all kinds at joeydantoni.com and writes a monthly column for Redmond Magazine.