Loosly Coupled Redundancy

Introduction

Most server redundancy schemes involve tightly interdependant redundant servers. For instance, they might share a physical data storage device, like a fiberchannel RAID, which introduces a single point of failure, or highly specific technology, like GNDB, which can depend on a specific linux kernel or distribution.

While many cluster environments primarily aim to solve problems of performance with only secondary attention to fault tolerance, I need a system that is first and foremost fault tolerant.

The real crux of my problem is the definition of "fault tolerant". Most of the faults that interrupt my services are not hardware failures, or software crashes, but intentional server changes, such as kernel upgrades requiring reboots, or software upgrades that may or may not be compatible with older versions.

Most high-availability cluster designs fail to take this simple fact into account. How can you make a cluster system that allows you to perform all the maintainance tasks you need to perform without interrupting service? You design a cluster where the two redundant elements are as independant as possible. You make them loosly coupled.

In a loosly coupled cluster, the cluster is resiliant to large differences between the component servers. That means you can upgrade one while the other is serving requests. In practice, I've made it so one server is the main server while the other is the backup server (though this is not required.) I can make changes to the backup server, test them, and then make the backup server active while I upgrade the main server.

User Data and Configuration Data

User data are things like user files and account information (/etc/passwd etc..). This information must be the same on both servers or they will differ in ways that will make them non-redundant.

Configuration data, on the other hand, is the nitty gritty of how stuff works, and it can be different between the two servers while still providing identical services. For instance, I can run Apache on one server and Roxio on another but still serve identical web content.

This distinction is important because I want to maintain redundancy while allowing difference between the two servers. I can mirror user data between the two servers, but configuration data must be managed manually.

My Current System

Right now, I have two web servers working as a loosly coupled cluster. They each work independantly as a web server and they each contain all user data on thier local hard drive.

Rsync

Running rsync every 5 minutes leaves a possible 5 minute difference between the servers in the event of an unplanned outage. This is acceptable for me, and it has many benifits.

Managing Configuration Data

I manually syncronize configuration because

There is No Automatic Failover

Problems

Because configuration data is managed manually, it is easy to let the servers slip out of a truly redundant state. A fix to the main server may be forgotten for the backup server, for example.

Some user data depends on configuration data. For example, accounts are created by scripts on the main sever, then rsync replicates the changes to the backup server. If both machines are trying to perform account maintainance, they might step on eachother's toes and get out of sync. If only the main machine performs account maintainance, accounts will go unmaintained while the main machine is offline.

Loosly_Coupled_Redundancy (last edited 2008-02-27 01:17:57 by localhost)