[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

High Availability/Failover clusters



Just found out about this list.  There has been much discussion on the Linux-HA
mailing list (http://linux-ha.org/) about "cluster" issues oriented to support
for "High Availability":
- resource management, e.g., database engines, IP addresses, disks, file
systems
- failover, e.g., of disk control, applications, databases, etc.
- IP takeover
- monitoring, e.g., heartbeating, weak/strong membership services

As a general rule, we've viewed this support as being mostly embodied in
user-space daemons, with limited amounts of in-kernel code, and only rare
actual kernel changes (well, actual kernel support that we desire):
- scsi reserve/release, multi-tailed I/O, etc.
- IP aliasing/MAC address changes
- softdog (timer-based watchdog capability)
- soft-real-time scheduling for HA daemons, since these control everything and
when they need to run, they NEED to run!

I am not saying that various of the HA components can't be in the kernel, they
can be (e.g., our distributed lock manager project
http://oss.software.ibm.com/developerworks/projects/dlm) but if so they are
often loadable modules and don't usually require tight tie-ins to the kernel. 
This is often the direction such work has taken in commercial HA clusters, and
it also keeps it relatively independent of the kernel version.

One effort that has happened piece-meal but is receiving more focus in the
Linux-HA community in the near future is to work on defining 'componentry' to
provide layered and granular services useful to all aspects of what we think of
as an HA cluster to allow you to mix and match different components to exploit
only the level of service you require.  If you look at http://linux-ha.org/
you'll see some of this, thanks greatly to Alan Robertson, but many other
contributors.  We hope to gain much more momentum on this over the next few
months.

We generally view "HA clusters" as relatively tightly integrated, usually - but
not always - with shared disks, and requiring strictly controlled access to the
resources.  For example, a failover database server, where uncontrolled disk
access means data corruption.  Bad.  We also view them as being relatively
small, with numbers of nodes in a cluster being single digits up to 16 or 32,
not 100s of nodes.

Ah, a GFS+Mosix+Database cluster, requiring IP failover, start/stop/monitor of
the database and other applications, coordination of all of the above, would be
a valid, and very interesting, cluster.  There are different control aspects
that what we usually view on HA clusters.


=====
These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
wombat@us.ibm.com/tabmowzo@yahoo.com
and in no way should be construed as official opinion of 
IBM, Corp.

__________________________________________________
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail. 
http://personal.mail.yahoo.com/

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/