[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Hacqs] Re: ETCP Project
> > I have one thing in my clusters I want to make failover, that's
> > the "master" node which runs the queue system. The queue system has a
> > fairly small amount of state, so drbd+heartbeat/takeover looks like
> > it's good enough for my purposes. Neato.
>
> Is your queuing system open source software?
Yes, it's OpenPBS. It (and the commercial PBSPro) are careful to keep
all of their state in little files on disk and to fsync files
frequently. It's not guaranteed that things won't go wrong, but it
seems to be fairly unlikely. Maybe you could talk them into using the
new berkeleydb stuff from sleepycat.
I would suspect that some of the other open source queue systems have
this same property. DQS, GNU Queue, and the allegedly-to-be-open-sourced
Codine/GRD/whatever it's named today from Sun.
Now do keep in mind that surviving a queue server crash is completely
different from having jobs that can survive a compute node crash... or
jobs that can be correctly restarted from the beginning or a
checkpoint...
-- greg
Linux-cluster: generic cluster infrastructure for Linux
Archive: http://mail.nl.linux.org/linux-cluster/