[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ETCP Project



On Thu, Mar 01, 2001 at 03:45:59PM -0700, Alan Robertson wrote:

> The hard problem isn't at the transport layer, but at the application layer
> - in synchronizing application state.  For migration clusters, this is
> *comparatively* easy, but for failover clusters this is typically very hard.

I'd agree. But the failover guys have better tools than I thought they
had. I have one thing in my clusters I want to make failover, that's
the "master" node which runs the queue system. The queue system has a
fairly small amount of state, so drbd+heartbeat/takeover looks like
it's good enough for my purposes. Neato.

By the way, I'm writing a piece of code that HA people might find
useful. It's called ForwardFS, and it's a filesystem which forwards
all _system calls_ to another system to get executed. Since the
forwarding is done on a system call basis, there is no caching or
weirdness related to using a block device, like a 1 byte write
followed by a flush causing 8k of traffic. If the node crashes, no
fsck is needed on the remote node. The minus is that there's no
caching, and a bunch of 1 byte writes cause separate network
transactions.

It wouldn't be hard to have system calls which read execute only on
the local node, and system calls which write get executed on the local
and remote node. Voila, it's a HA component.

Condor and Mosix and other migration clusters do this sort of thing
for most syscalls of individual processes, but not for just a part of
the filesystem.  I'm actually hacking up PVFS to write ForwardFS.

-- g


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/