If it’s August in Las Vegas, it’s time for the annual Black Hat cybersecurity conference. I expected AI to be a hot topic at this year’s show, and it certainly was. One theme that has had a bit of a resurgence is digital resiliency and disaster recovery. The topic was back in vogue because of the July incident when CrowdStrike released a faulty update to its Falcon Sensor cybersecurity software, causing more than 8 million Microsoft Windows systems to crash in what many are calling the most significant IT outage in history.
While in Vegas, I thought about some of the recommendations we’ve heard about BCDR that may have slipped our minds until the July mishap. It reminded me of a good take on the subject published by Veeam earlier this year, 10 Best Practices to Improve Recovery Objectives. Here are a few BCDR points from the white paper that I find especially relevant.
Use air-gapped backup storage
You need modern backup methods to protect your data from ransomware and other threats. Conventional storage may not do the job. Here are some options:
-
Hardened repository: Use a Linux server with the immutability attribute in Linux file systems. Make sure the physical machines you use for backup have restricted access and other safeguards. Use an on-host firewall with all unneeded ports blocked. If you use a hardware firewall, ensure it has the necessary throughput to avoid bottlenecks. Install no other applications on the machine to avoid introducing new security risks to the server.
-
Immutable object storage: To ensure backups are undeletable, your object storage must support immutable objects.
Test your backups for recoverability
Too many things can go wrong, and any of them can result in the loss of critical data. Backups are not a set-it-and-forget-it technology. Testing data backups is the IT version of the old carpenter’s adage, “Measure twice and cut once.” You can’t afford to take shortcuts when it comes to how completely—and quickly—you can recover data from your backup systems. And perform regular health checks on your backups, too. If storage is unstable and corrupted with “bit rot,” you’ll have a massive problem.
There’s no such thing as backup and restore hardware that’s “too fast”
The essential data backup and restoration metrics are recovery point objective (RPO) and recovery time objective (RTO). RPO is how much data loss your organization is tolerable for business impact. RTO is the maximum time your business can be offline. These two objectives define how you should build your backup strategy, how often backup jobs will run, and what type of backup you need.
To achieve your organization’s unique RPO and RTO goals, you need hardware that’s fast enough to handle the job. This applies to both the backup hardware and your production environment. Here are some key details:
1) Make sure you have the required bandwidth in place for fast, accurate restoration—before you need it.
2) Pick the optimal transport method for your business needs. Options include:
-
Backup from storage snapshots
3) Testing will determine if you have what you need for these critical capabilities or need to beef things up before they’re needed.
All restore modes are not created equal
If you find yourself needing to recover data or entire machines, you have a lot of choices for the recovery mode you choose. To pick the right method, answer these three questions:
-
What do you need to restore?
-
What’s the purpose of the recovery?
-
How much time do you have?
The most important thing to keep in mind with a restoration mission is to ensure you restore the damaged or missing data without overwriting otherwise good and more recent data. For example, if you need to restore an operating system following an OS or application update, you don’t want to restore all drives because you’d end up overwriting data that’s more current than the backup. So, focus only on restoring the affected drive.
Does your organization have a malware or ransomware attack response plan? Achieving your RTO goals will be much easier if you have already tested an attack response plan before you need to use it. Get expert help before you try to recover your data. Doing so might be all that stands between an effective recovery and one that fails to restore affected machines to a point in time before the malware infiltration.
Plan. And then plan some more
I’ve covered just some of the ten in-depth recovery scenarios and requirements that Veeam presents in its white paper. To learn more, download it here and get all the details. I’ll leave you with one final thought on this crucial subject.
It’s vital to avoid what Veeam calls “chicken-egg” issues. For example, you’ve done the right thing and encrypted all your backups. Great job. But if disaster strikes and the backup server is lost, you must rely on the backup hardware I discussed earlier. So you go to your password application to get the key to decrypt the backup server, only to experience the nightmare scenario of the password app being on one of the failed servers.
To prevent this frustrating scenario, keep your password app and its essential data safe but accessible. You could print the information and keep it away from your now-destroyed servers in a secure—and fireproof—safe. Or, you could have worked with experts in the field and set up a cloud-based solution for secure access to that priceless password information.
Don’t wait for the next CrowdStrike-level outage or the fire or flood you were sure would never affect your operations to design and implement your BCDR plan before you need it. Set a goal to have your plan operational—and tested—well before next year’s Black Hat conference.
Zeus Kerravala is the founder and principal analyst with ZK Research.
(Read his other Network Computing articles here.)
Related articles: