[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: High Availability versus Automatic Process Migration



On Wed, Feb 28, 2001 at 09:24:33PM -0700, Alan Robertson wrote:

> I don't know enough about HPC customers, but I don't associate these
> characteristics with the HPC arena.

You'd be surprised. But even the national weather service doesn't mind
if their forecast takes twice as long to run on (rare) occasion; they
just want the answer almost all of the time in the allowed window, and
they throw extra cpus at the problem to get the time down to a small
fraction of the window. Then a rerun due to failure doesn't violate
the deadline.

So, the timescale for getting the right answer is different. For most
commercial HA clusters, you want to transfer in fractions of seconds
or seconds. For a HPC system, well, if I have to occasionally restart
that 100 node job from the beginnig, it's not the end of the
world... I just don't want the user to get back a failure because a
node died.

Not surprisingly, this affects cost. The HPC version of weak HA
requires little extra equipment. But even on my HPC system, I'd like
to have my admin node with the queue system and etc be a 2-node HA
system. Not to mention the controller for my parallel filesystem and
mass-store... as long as it's cheap enough.

> A few examples come readily to mind:
> 	Cluster membership and corresponding event APIs
> 	single-image boot
> 	cluster filesystems
> 	system monitoring
> 	node reset mechanisms (i.e., Stonith)

These are related, yes. Gee, I never realized Stonith needed a name...
I use APC masterswitches so I can remotely power cycle nodes, just for
system admin convenience.

-- g

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/