[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

High Availability versus Automatic Process Migration



There are two problems associated with automatic process migration when
considered in the context of HA:

1) The HA cluster manager wants to control what's running where *all the
time*, 
	so it can take proper recovery actions and guarantee the
	paranoid-by-definition customer what they want in terms of migration
	strategies.

2) Some kinds of process migration (like Mosix) are low-availability
solutions:
	if every process has migrated from its original machine loss of any
	node can kill all processes (for a 2-node system).

Successful HA customers often have these characteristics:
	Availability and data integrity are EVERYTHING
	control-freaks
	anal-retentive
	paranoid
	perfectionists

I don't know enough about HPC customers, but I don't associate these
characteristics with the HPC arena.

My guess is that it will be these differences that push the solutions
farther apart and make them to separate market niches - even if all the
technology could otherwise be common.

Having said that, there are MANY possible common elements between HA and HPC
clusters, and many common solutions are possible.

A few examples come readily to mind:
	Cluster membership and corresponding event APIs
	single-image boot
	cluster filesystems
	system monitoring
	node reset mechanisms (i.e., Stonith)

Unless a given feature REQUIRES a kernel implementation by definition, I
would strongly recommend against using /proc-like interfaces for user
programs, but instead define an API which can be easily and sensibly
implemented by user-level programs.

Of course, if there is a /proc-thinggie around, the API could just turn
around and ask /proc (through a plug in model).  BUT, the applications
shouldn't be doing this themselves.

Heartbeat is a user-space cluster manager.  We originally implemented a
/proc interface for it because it was cool, and could be common with kernel
implementations.  It was also a mistake, and has been dropped.  I now
believe that doing it the other way around makes more sense.

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/