How to Achieve Disaster Recovery Automation?

To enhance disaster recovery, automation is vital in reducing recovery time and minimizing risks. Customizable platforms, business grooming, and thorough testing are key factors for successful disaster recovery solutions.

download-icon
Free Download
for VM, OS, DB, File, NAS, etc.
dan-zeng

Updated by Dan Zeng on 2024/01/05

Table of contents
  • Automated Disaster Recovery Platform Selection

  • How to determine and resolve brain cracking in disaster recovery dual-active architectures?

  • How can application-level failover be automated on a virtualization platform?

  • How to ensure RTO with numerous disaster recovery systems?

  • Automated Disaster Recovery for Virtualization Platform Applications

  • How can automated switching cover applications?

  • Achieving Complete App Switchover via Disaster Recovery Automation

  • How to consider disaster recovery solutions?

  • How should you choose the data synchronization method for two centers?

  • How frequent is the disaster recovery exercise switchover?

  • Protect VM with Vinchin Backup & Recovery

  • Conclusion

Are you looking for a robust VM backup solution? Try Vinchin Backup & Recovery!↘ Download Free Trial

In the disaster recovery construction of the bank, a two-site three-center disaster recovery solution with slight differences between the two locations was adopted. The solution involves constructing two data centers within the same city and utilizing network layer two connectivity technology. The main purpose is to realize the dual activation of applications in the same city and the system can run in any server room.

However, in terms of disaster recovery technology, due to the recent launch of the new core banking system, application-side dual-active technology that requires modifications to the applications has not been adopted. Additionally, the database log transmission technology may result in data loss, making it impossible to guarantee a recovery point objective (RPO) to zero in the event of a disaster.

Therefore, based on the existing technology stack, a storage replication technology with minimal transformation and impact has been adopted. Although activating various resources at the disaster recovery site during the switching process may result in a longer RTO, efforts are made to ensure that the RPO is zero.

Automated Disaster Recovery Platform Selection

In order to minimize RTO under the premise of guaranteeing RPO, the main problems faced by disaster recovery operations include cross-departmental command and coordination, monitoring the progress of the switchover process, and the maintenance of numerous processes. In order to realize safe and orderly disaster recovery switchover, disaster recovery automation becomes an important solution. It can improve efficiency, minimize RTO, and reduce the risk of human error.

However, implementing automated disaster recovery solutions is not easy, especially for complex business systems. For example, banking systems require tedious work such as business grooming and weight assessment. Meanwhile, the disaster recovery plan is constantly improved through actual switchover exercises. When choosing a disaster recovery automation switching platform, the following points should be emphasized:

1. The platform should be able to flexibly customize according to user requirements to adapt to the needs of different business systems.

2. It is equipped with a simulated disaster recovery switchover function to help preview the switchover process and identify potential problems.

3. Provide a logging system and switch traceability functionality to track issues during the switching process and offer improvement suggestions.

4. A friendly graphical user interface allows simple customization of complex systems by means of drag-and-drop icons.

Since each company has different business systems and IT technologies, it is difficult to apply a generic business product to all situations. Therefore, it is crucial to make continuous improvements to solve problems during the use process. At the same time, factors such as the product's support for the company's existing IT technology and performance are taken into account to ensure that the disaster recovery automation product selected meets actual needs and future development.

How to determine and resolve brain cracking in disaster recovery dual-active architectures?

The problem of brain cracking is a challenge that requires special attention in disaster recovery dual or multi-active architectures. Brain cracking typically occurs in highly available systems and can occur when multiple nodes consider themselves to be master nodes. In dual-active architectures, data center-level switching usually requires manual intervention because of the risks associated with automated switching, where monitoring and switching tools can fail or misjudge. To prevent brain cracking issues, administrators typically manually designate the primary backup node and ensure that the failed node is no longer processing transactions. It is also possible to consider blocking traffic or attempting to shut down the failed node, but success is not always possible.

When dealing with site-level disasters, it is necessary to rely on multiple levels of monitoring to make judgments. Site-level decisions should focus on the impact of business transactions because technical failure scenarios are varied and business impacts are difficult to determine immediately. To minimize business monitoring misjudgments, independent monitoring can be introduced and verified with the help of customer feedback. However, monitoring is still required as the primary means of accurate assessment, as business feedback is often slow and imprecise.

In a disaster recovery dual- or multi-active architecture, dealing with data link problems usually requires the introduction of a third-party arbitration mechanism to determine whether it is a link problem or a site disaster. Third-party arbitration typically employs an arbitration host and communicates with the site via iSCSI or other means. Automated determination of site-level disasters is not recommended because of the high potential risk and decision complexity. Determination of site-level disasters requires multiple levels of monitoring and human intervention to ensure a timely and accurate response.

How can application-level failover be automated on a virtualization platform?

When adopting virtualization technology and planning to synchronize VM files with a disaster recovery center using storage replication. If the primary center fails. virtualization must be started manually. Global load balancing is then utilized for failover. For security reasons, an automatic failover method has not yet been determined. This is because unnecessary disaster recovery switchovers should be avoided.

There are two suggested options in dealing with this situation:

1. Do a good job of monitoring the business and the physical machines that have been virtualized. After monitoring triggers a business failure, use the ELK platform and failure analysis platform for failure analysis. Next, trigger disaster-tolerant level failover through the monitoring and analysis results. The automation platform is responsible for the underlying switchover, including the configuration of the network and load balancing equipment.

2. For the VM switching scheme there are two problems: First, whether the virtual machine retains the previous IP after switching, and if so, how to solve the IP conflict problem; Second, there is a delay in data replication, and it is necessary to clarify which data has a delay and whether it is tolerated by the business. In an actual failure situation, it may not be possible to re-synchronize the data. It is recommended to consider continuity from a business perspective and not have to rely on application or VM switching to ensure it. Full VM migration is complex, difficult, has a high failure rate, and takes a long time.

How to ensure RTO with numerous disaster recovery systems?

There are numerous disaster recovery systems. In order to ensure RTO and improve communication efficiency, requirement analysis can take the following measures:

1. Business hierarchical management: hierarchical management based on the importance and urgency of the business. Ensure fast recovery for high-level applications and tolerate longer recovery time for low-level applications.

2. Integration of disaster recovery scenarios and failure domains: In order to reduce cost and complexity, disaster recovery scenarios and failure domains can be integrated. For example, designing an overall server recovery scenario reduces the consideration of individual component failure scenarios. Fast switching of server room modules or data center level switching can also be considered to reduce dependence on a single server room.

3. Modularizing Business Systems: To reduce RTO, you can modularize business systems, enabling rapid switching through modular process orchestration. This improves switch efficiency and minimizes the impact on the entire system.

4. Disaster preparedness command platform: In order to improve cross-departmental communication efficiency and reduce communication costs, a disaster preparedness command platform can be established.

Automated Disaster Recovery for Virtualization Platform Applications

Automated disaster recovery solutions require:

1. Clarify switchover conditions and planning: Determine the conditions for automatic switchover, such as network failure, storage failure, etc., which are triggered by the monitoring system. Planning the switchover process, including network, storage and application switchover, to ensure controllability.

2. Utilize disaster recovery automation switchover software: Use automated disaster recovery switchover software that supports network and storage interfaces, including device interfaces (e.g., OpenStack, vCenter, PowerVC, HMC, etc.) to include all critical devices and perform switchover operations.

3. Planning and grooming: Prior to implementation, business systems are planned and organized. This includes correlation and network storage structure to define the switching process in detail.

4. Operation steps sorting: Delineate the steps of the switchover operation, including storage, operating system, database, application and network switchover. Each aspect of the steps need to be clearly defined and drawn into a flow chart for deployment into the automated switching software.

5. Exercise and Improvement: Conduct daily exercises to verify the feasibility of switching, and make adjustments according to the results of the exercise to ensure that the switching process is smooth and controllable.

How can automated switching cover applications?

The following methods can be used to realize automated switching to cover applications:

1. Application standardization: Ensure that applications are standardized. Use standard catalogs and named scripts for step-by-step or one-click start/stop. This helps with application access and unified management.

2. Generic Scripts: Create generic scripts for application start/stop and status checking. The generic scripts are on the server automation platform and are invoked by the disaster preparedness command platform. The application-specific scripts are stored locally on the server and called by the generic scripts. In this way, the generic script is responsible for managing the start/stop logic, while the application-specific scripts only need to provide specific parameters, which reduces the workload of operation and maintenance personnel.

3. Standardized parameters: For standardized applications, the operation and maintenance personnel provide the necessary parameters, such as application path, log path, etc. The generic script generates start/stop logic based on these parameters. Generic scripts generate start/stop commands based on these parameters, simplifying the operation process.

Achieving Complete App Switchover via Disaster Recovery Automation

When dealing with applications of different disaster recovery levels and the association between them, the following methods can be taken:

1. Classification of disaster recovery levels: Classify applications into different levels according to their importance.

2. Association relationship sorting: Determine the interdependencies between applications.

3. Closed application switching: Ensure that module switching within an application does not depend on other applications.

4. Logical application switching: switch applications with dependencies as a whole.

5. Meticulous partitioning: Cut complex applications into small modules and switch them one by one.

How to consider disaster recovery solutions?

1. For the selection of disaster recovery solutions, considering the needs of people's livelihood and continuity, storage-level synchronization technology, dual-living technology or the AA mode of application can be used to achieve disaster recovery.

2. When building an automated disaster recovery platform, an appropriate amount of human and material resources need to be invested. The specific human and material investment depends on the scale and complexity of the project.

If you want to build a similar disaster recovery platform, you need to make the following preparations: First, conduct a technical assessment to clarify the technology stack and disaster recovery solution to be adopted; Second, ensure the full participation of technical staff to avoid the limitations of a single person's technical capability; Meanwhile, conduct a comprehensive test, including functional and stress tests, to ensure the reliability and performance performance of the disaster recovery system.

How should you choose the data synchronization method for two centers?

When building a co-location data center, choosing the right data synchronization method is crucial. Storage replication technology and database logical replication technology are common choices. Storage replication technology is stable and reliable, suitable for real-time synchronization; While the logical replication technology comes with the database has high availability in daily operation and maintenance, and the switching process is relatively simple. According to different business needs and data consistency requirements, choose the most suitable data synchronization method.

How frequent is the disaster recovery exercise switchover?

The frequency and scope of switchover for disaster recovery drills are determined according to actual needs and are usually conducted every six months or once a year. The entire data center or specific business systems can be switched. The handling of related businesses requires a coordinated plan to ensure consistency and smooth transition, including data synchronization, network configuration and business process adjustment.

Protect VM with Vinchin Backup & Recovery

While upgrading disaster recovery capabilities, you can also consider realizing a higher level of data protection and backup. With virtualization technology, backups of VMs can be created more efficiently to ensure data security and recoverability.

Vinchin Backup & Recovery

Vinchin Backup & Recovery is a backup solution designed for virtual machines of VMware, Hyper-V, XenServer, XCP-ng, oVirt, RHV, etc. In terms of DR, Vinchin provides a superior disaster recovery solution that quickly recovers data from any backup point in seconds and quickly resumes business operations in minutes. 

In addition, Vinchin provides comprehensive and powerful VM backup and recovery features like agentless backup, instant recovery, V2V migration designed to protect and manage critical data in the virtualization environment.

Vinchin Backup & Recovery’s operation is very simple, just a few simple steps. 

1.Just select VMs on the host

2.Then select backup destination 

3.Select strategies

4.Finally submit the job

Vinchin offers a free 60-day trial for users to experience the functionality in a real-world environment. For more information, please contact Vinchin directly or contact our local partners.

Conclusion

Implementing an automated disaster recovery solution is essential for minimizing recovery time objectives and ensuring business continuity. Properly selected tools and strategies enable smoother and more efficient disaster recovery operations.

Share on:

Categories: Disaster Recovery