Print Article

Disaster Protection Primer, Achieving Fault and Disaster Tolerance in a Microsoft Windows Environment
by Margaret M. Kelleher

Business continuity planning is moving to the top of everyones priority list. As an IT manager, you are faced with two opposing challenges. Your CEO wants reassurance that the company will operate smoothly through emergencies that range from rolling blackouts to biblical flood, fire, and pestilence. At the same time, every penny in your IT budget is being carefully scrutinised. The pressure is on you to find cost-effective ways to protect data and applications under a wide range of adverse conditions.

Choosing the right disaster protection for your Microsoft® Windows® environment can be a complex and potentially risky proposition. Do you save money with a low-cost alternative such as tape backup and risk losing large amounts of data? Or do you invest in more advanced technologies, such as a fault and disaster-tolerant solution? The answers may not be obvious.

Begin the decision-making process with a clear understanding of your disaster protection needs, your current staff skill set, and your budget. Prioritise your needs as must-have, nice-to-have, and wish list. The following questions will help with the needs assessment:

- Do you need to protect data, applications, or both data and applications?

- How much data can you afford to lose during a failover or recovery process?

- Does data loss put you at risk of reduced customer satisfaction? Regulatory non-compliance?

- Is your most critical data live and continuously changing, or archived?

- How much downtime is acceptable in your organisation?

- How much does a minute of downtime cost you?

- What is the impact of your downtime on your customers?

- How well is your staff trained and equipped to implement and manage technologies requiring a special skill set, such as two-way clusters?

- Does every member of your IT staff have a clear understanding of his/her role in the event of a disaster?

- Are there political-critical applications that senior management needs you to keep up and running no matter what?

When creating a business continuity plan, be sure to consider the physical and emotional toll that disaster recovery may take on the IT staff. A low-end solution may save money in the short term, but recovery with these solutions may require unrealistic levels of work from employees. You may be relying on employees affected by a disaster to string LAN cabling; locate, purchase, install, and configure replacement hardware; clone servers from backups; and get applications restarted. You will also face the challenge of sorting through the data that may have been lost, duplicated, or corrupted.

Once you have a clear picture of your needs, evaluate the strengths and weaknesses of the four leading categories of disaster protection: tape backup, off-site data archiving, remote replication and restart, and fault and disaster-tolerant technology.

Tape Backup

At the low end of the disaster protection spectrum is tape backup. With this technology, data on all critical servers is routinely downloaded to tapes. The tapes are shipped to remote locations for storage. In the event of a disaster, all stored data can be read back onto servers.

This option is for companies that want to protect data that is not live or subject to continuous change. It is often used in combination with other solutions as a means of archiving data. The benefit of tape backup is that it is relatively inexpensive, provides a basic level of protection for archived data, and does not require specialised skills to implement or manage. However, tape backup has three significant weaknesses: it does not protect applications; it leaves a gap in data between the time of the last backup (before the disaster) and the time the system is restored; and the backup process is often cumbersome and time consuming.

To restore from tape, new server hardware must be obtained, which, if available, may require costly last-minute purchase or rental. Hardware must then be completely configured, loaded with applications, and restored from tape before service is restored. Physical backup of a complete system image is relatively simple and fast, but a partial restoration is very difficult. Logical backup is more time consuming and tougher to administer. New data obtained after the backup and before the disaster may never be recovered.

Off-Site Data Archiving

A more advanced way to protect data is through off-site data archiving. This solution allows ongoing replication of data to a remote location over a shared or private IP-based LAN, WAN, or SAN connection. Because it is replicating continuously, this solution offers significantly more data protection than tape backup. It also eliminates the need for after-the-fact configuration and acquisition of hardware after a disaster.

The drawbacks of this solution are that in-flight transactions between the primary server and the backup are lost during a fault or failover. As in tape backup, there is a gap in data protection between the time of the last backup before the disaster and the completion of replication. In addition, these solutions are vulnerable to problems in the network. This option also requires investment in off-site facilities and hardware.

Remote Replication With Restart (Cluster over Distance)

If you need to protect your application and data, you may consider remote replication with restart. This solution is a clustering solution over distance.

In a cluster over distance solution, a backup server is configured with the same application and directory structure as the primary server. The two servers share a common disk called a quorum disk. In the event of a disaster, instructions on the quorum disk initiate a failover to the secondary server, which stands in automatically for the primary server. The failover process can take ten to thirty minutes, during which in-flight data and transactions are lost and the end user has no access to the application.

As illustrated in Figure 1, consider a fault- and disaster-tolerant solution before choosing a cluster. Fault- and disaster-tolerant solutions provide a higher level of protection for a comparable price and a much lower Total Cost of Ownership (TCO).

Clusters are the most technically challenging and time-consuming solution to implement and manage. Clusters have four key weaknesses:

They require extensive custom scripting and configuration as well as detailed failover and failback testing. Specialised skills, such as scripting and cluster API programming are needed to administer these solutions on an ongoing basis. The shared quorum disk represents a dangerous potential single point of failure during a disaster. Application protection is only provided for special cluster-aware applications.

Fault and Disaster Tolerance

The highest level of disaster protection is both fault and disaster-tolerant. Unlike disaster recovery solutions, fault and disaster-tolerant technology continues to operate, uninterrupted, throughout a disaster with no interruption of service to end users. Marathon Technologies Endurance® Long Distance SplitSite® (LDSS) is the only fault and disaster-tolerant solution for Windows servers on the market.

Fault and disaster-tolerant solutions operate two completely redundant systems as a single server, simultaneously processing on the same data at the same time in two different locations. The two systems can be located up to ten kilometres apart with separate power sources, network connections, and disaster protection. If one system is destroyed by disaster, the other continues to operate with no loss of data, no loss of transactions, and no perceptible loss of performance. In the event of a failure in the fibre interconnections, the Marathon software automatically keeps the most fault free system up and running and brings the other system down to ensure data integrity. Service continues while the damaged system is repaired or replaced and brought back online. Because the server has been in continuous operation, no failover or restoration process is needed. The repaired system is automatically resynchronised with the operational system.

Hardware and Software Protection

The system is also resilient to software failure. The design (see Figure 2) physically and logically separates the two basic operations that all computer systems perform-computing and storage and I/O processing. Each of Marathons two redundant systems is made up of two computers. One server-class computer is configured for computing functions, such as running the application, with a high-performance CPU and enough memory to run the application. This computer has no storage or I/O capability. The second is configured for I/O and storage with minimum CPU and memory, with all of the I/O capability the system will need, including disk, network, tape, etc. The two servers are connected with high-speed cables and a PCI card called a Marathon Interface Card (MIC). Marathons software makes the two components function and appear to the network, the application, and even administrators as a single server for operational and management purposes.

The Marathon software redirects all storage and I/O activity to the I/O portion of the system. In doing so, it shields the application from I/O faults that could take it down by presenting only well-behaved I/O operations to the application. If a driver corrupts registers or memory, or otherwise causes a server to fail, then only one part (I/O and storage) of the system fails. The server running the application continues to operate, unaffected, using I/O capabilities of another system. End users experience no loss of service or performance.

Two identical systems are connected to form a complete fault-tolerant or Endurance LDSS server. The application computers (see Figure 3, CE 1 and CE 2) in each system run in instruction lockstep with each other, executing each program instruction at the same time, using the same data. If the Marathon software detects any discrepancy between the two systems, it will immediately remove (or isolate) the cause of the wrong data. The storage and IOP computers in each system run loosely coupled with each other, doing the same things at the same time. But because they are dealing with asynchronous events such as disk reads or network replies, they are not kept in strict synchronisation.

Failure of any component in the application computers will cause the offending computer to be taken out of service, while its lockstepped partner continues to run the application. As soon as the problem is corrected, the repaired component is brought back into service automatically. All of the memory, register states, etc., from the remaining good computer are copied back over to the repaired one, and it is resynchronised. Failure on the storage and I/O computer is handled similarly. If it is a hardware failure of the computer or any of its components, the system is taken out of service. When it is repaired, it is manually rebooted and automatically brought back into synchronisation by the Marathon software.

Take the Pressure Off

A careful assessment of your companys data and application needs, as well as of the impact of downtime on customer satisfaction, employee productivity, and employee morale, will help you implement a business continuity plan that protects your business and your staff without emptying your budget. Consider a fault- and disaster-tolerant solution if you need to protect your applications and data.