High Availability (HA)

TFCC-HA

The focus of the TFCC-HA technical activity is both on the reasearch as well as development/implementation aspects of HA solutions. It will monitor development of HA solutions in the industry, with special interest in various problems encountered in practice in such implementations, and also the common services offered by the various vendors of HA solutions.

What is HA?

High Availability (HA for short) refers to the availability of resources in a computer system, in the wake of component failures in the system. This can be achieved in a variety of ways, spanning the entire spectrum ranging at the one end from solutions that utilize custom and redundant hardware to ensure availability, to the other end to solutions that provide software solutions using off-the-shelf hardware components. The former class of solutions provide a higher degree of availability, but are significantly more expensive, than the latter class. This has led to the popularity of the latter class, with almost all vendors of computer systems offering various HA products. Typically, these products survive single points of failure in the system,

Related Terminology

Continuous Availability: This implies non-stop service, with no lapse in service. This represents
an ideal state, and is generally used to indicate a high level of availability in which only a very small quantity of downtime is allowed. High availability does not imply continuous availibility.
Fault Tolerance: This is a means to achieve very high levels of availability. A fault tolerant system has the ability to continue service despite a hardware or a software failure, and is characterized by redundancy in hardware, including CPU, memory, and I/0 subsystems. High availability does not imply fault tolerance.
Single Point of Failure (SPOF): A hardware or software component whose loss results in the loss of service; such components are not backed up by redundant components.
Failover: When a component in an HA system fails resulting in a loss of service, the service is started by the HA system on another component in the system. This transfer of a service following a failure in the system is termed failover.

Impact of Cluster Computing on HA

With the increasing popularity of cluster computing in the last decade, it is being widely used in various types of fields, including mission critical applications. Such applications are typically not tolerant to system failures and the resulting loss of service, and require that the system on which they run provide little downtime, planned or unplanned. This has naturally led to a strong demand for good HA solutions for the cluster environment, and most of the important players in the industry are providing HA solutions, with varying feature-sets and varying levels of service (in terms of uptime guarantees). Most commercial HA systems currently support a cluster of two or four servers, with future plans for larger clusters.

Research Projects on HA
(This is by no means an exhaustive list. If I have missed projects that belong here, please send me email with that information.)

High Availabilty Linux Project

Linux Virtual Server Project

Solaris-MC

The Horus Project

The ISIS Project

Commercial HA Products
(This is by no means an exhaustive list. If you are aware of a link I have missed, please send me email.)

Compaq's TruClusters

Data General's DG UX Clusters

HP's MC/ServiceGuard

IBM High Availability Products (CRS, HACMP/SP, Netfinity, and Sysplex)

Microsoft Cluster Services (formerly Wolfpack)

NCR's LifeKeeper

Novell's High Availability Solutions

RSi's Resilient Server Facility (RSF)

Sequent's ptx/CLUSTERS

SGI's IRIS FailSafe

Siemens Reliant Monitor Software (RMS)

Stratus's CA Solution

Sun Clusters

TurboLinux TurboCluster Server

Veritas Firstwatch

Publications on HA
(This is by no means an exhaustive list and includes white papers, reports as well as vendor published articles. I will add other important ones as I work on this page. Also, as I receive feedback about other links that I am not aware of, I will add those as well.)

The D. H. Brown Reports

D. H. Brown Associates, Inc., (DHBA) is a leading research and consulting firm, and its comparative report on various commercial UNIX- and NT-based HA products is considered to be the definitive expert report. A summary of each report is available to DHBA subscribers free of charge. You can also view the report online by following links to it from the various vendor sites listed above.
The IMEX High Availability Report

H. Milz, "Linux High Availability HOWTO"

This document describes the concepts of Linux-HA, which is slated to become the operator manual for Linux-HA.
R. Nelson, "Exploring High Availability Issues with BEA Tuxedo and Third Party High Availability Software", Aurora information Systems, January 1998.

This paper discusses the shortcoming of Tuxedo for HA, and how to improve it in combination with a commercial HA package.
L. Sherman, "Choosing the Right Availability Solution", Stratus Computer

This white paper compares HA clusters and fault-tolerant systems.
W. Vogels et al, "The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High Availability and Scalability", 1998 International Fault Tolerant Computing Symposium.

The HASH (High Availability Software & Hardware/Clusters) Reports

This is also a part of the D. H. Brown HA effort. This URL has links to noteworthy news items regarding recent developments in HA in the industry.
Turbo Linux Cluster Web Server Version 1.0 White Paper

Losing Weight at COMDEX With NetWare Cluster Services for NetWare5

This article on Novell's NetWare Cluster Services is about their 300,000 user 14 server cluster at Comdex Fall '99.
Uptime in real Time With NetWare Cluster Services for NetWare 5

Another article on NetWare Cluster Services from Novell.

Contact details

Coordinator: Ira Pramanick
Postal mail:
Sun Microsystems
UMPK17-202
17 Network Circle
Menlo Park, CA 94025
E-mail: ira.pramanick@sun.com
Phone: 650-786-0892
Fax: 650-786-8549