From owner-linux-cluster@nl.linux.org Thu Mar  1 00:17:51 2001
Received: by humbolt.nl.linux.org id <S92343AbRB1XRe>;
	Thu, 1 Mar 2001 00:17:34 +0100
Received: from web9208.mail.yahoo.com ([216.136.129.41]:28932 "HELO
        web9208.mail.yahoo.com") by humbolt.nl.linux.org with SMTP
	id <S92350AbRB1XQ7>; Thu, 1 Mar 2001 00:16:59 +0100
Message-ID: <20010228231656.22332.qmail@web9208.mail.yahoo.com>
Received: from [192.148.11.96] by web9208.mail.yahoo.com; Wed, 28 Feb 2001 15:16:56 PST
Date:   Wed, 28 Feb 2001 15:16:56 -0800 (PST)
From:   Peter Badovinatz <tabmowzo@yahoo.com>
Subject: Re: inventory
To:     Linux Cluster <linux-cluster@nl.linux.org>
In-Reply-To: <20010228175742.A2077@wumpus>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


--- Greg Lindahl <lindahl@conservativecomputer.com> wrote:
> On Wed, Feb 28, 2001 at 02:36:33PM -0800, Peter Badovinatz wrote:
> 
> > I would love to see 'one way' to set/read the node number.
> 
> What's a "node number" for? In the clusters I've built, nodes have
> unique names, which happen to be the Unix hostname. Is that not
> appropriate for your use?

A node name is a lot of data to send around and to maintain for cluster
components.  One 'complaint' we get is that our messages/memory/etc. take up
too much bandwidth as it is.  We often have to send around "maps" of all of the
nodes that are in the cluster as part of membership consensus decisions, so
sending the actual hostnames gets relatively large, especially as our code
usually does its main work during failure events, when the networks are often
flaky.

We can generate maps of nodes using the names, but we still have to agree as to
the positions or layout.  We also usually support (in fact usually demand to
eliminate single points of failure) each node have multiple network
connections, and multiple names, so a single hostname is not sufficient for us.
 We also tend to move IP addresses around (one of the jobs for which we get
paid!)  A node number makes it easier to uniquely refer to "a node" in these
cases, but, yes, it is a matter of sructure, degree and philosophy.

One other point is that most existing commercial HA clustering uses node
numbers, and if we're porting that, we use node numbers.
> 
> -- g
> 


=====
These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
wombat@us.ibm.com/tabmowzo@yahoo.com
and in no way should be construed as official opinion of 
IBM, Corp.

__________________________________________________
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail. 
http://personal.mail.yahoo.com/

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 00:27:59 2001
Received: by humbolt.nl.linux.org id <S92350AbRB1X1m>;
	Thu, 1 Mar 2001 00:27:42 +0100
Received: from saturn.cs.uml.edu ([129.63.8.2]:53006 "EHLO saturn.cs.uml.edu")
	by humbolt.nl.linux.org with ESMTP id <S92345AbRB1X1R>;
	Thu, 1 Mar 2001 00:27:17 +0100
Received: (from acahalan@localhost)
	by saturn.cs.uml.edu (8.11.0/8.11.2) id f1SNPGH182906;
	Wed, 28 Feb 2001 18:25:16 -0500 (EST)
From:   "Albert D. Cahalan" <acahalan@cs.uml.edu>
Message-Id: <200102282325.f1SNPGH182906@saturn.cs.uml.edu>
Subject: Re: inventory
To:     david@kasey.umkc.edu (David L. Nicol)
Date:   Wed, 28 Feb 2001 18:25:16 -0500 (EST)
Cc:     riel@conectiva.com.br (Rik van Riel), linux-cluster@nl.linux.org
In-Reply-To: <3A9D75CA.275E9F8B@kasey.umkc.edu> from "David L. Nicol" at Feb 28, 2001 04:03:54 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

David L. Nicol writes:

> Or at least defining what is and is not a "cluster service."
> For instance, PVM and rsh, as beautiful and useful as they
> are, do not require any blurring of the line between what is running
> in which box.

Just the opposite: they will have problems when you blur the
lines between boxes. Nodes need not have distinct names, and any
process migration can dump you right back where you came from!

Think of the above problem on a common SMP system. Can I rsh
from CPU 2 to CPU 3 in any meaningful way?

> It's stunning the level of not-made-here-itis that these projects
> can accumulate.

It's to be expected I think. Transport capabilities vary widely
and goals differ. Throwing together a Beowulf is not the same as
building fault-tolerant compact PCI systems for telecom.

> Lets focus on one thing that we know all clustering schemes have, and
> see if we can standardize it, then go from there.  My nominee of a
> suitable case for this treatment remains node numbering.
>
> Consensus appears that there is a standard file that can be read from
> to determine one's node number, or written to to change it.

To me, the node ID is a small integer, from 2 to 2000 perhaps.
The boot loader may directly set the value, using the System.map
file to determine location. It could be a command line option.

> Mosix "clutters up" /proc with all of its controls and displays; and
> it has been suggested that it is preferable to define a New File System
> for a clustering architecture's controls and mount it somewhere rather
> than doing this.  I like this approach since not only does it reduce
> the amount of patching (new clusterfs instead of altered procfs) it

If you need a big tree: new file system
If you need a file: use /proc for it

Hacking up /proc is needed anyway, for a shared PID space.

> trivially allows participation in multiple clusters by mounting multiple
> clusterfses at multiple places

I see this as featuritis. Boot into a cluster, and shutdown to leave.

> Are all in agreeement with the above notes and ideas?

No. More likely, are all in mutual disreeement?

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 00:52:05 2001
Received: by humbolt.nl.linux.org id <S92345AbRB1Xvq>;
	Thu, 1 Mar 2001 00:51:46 +0100
Received: from saturn.cs.uml.edu ([129.63.8.2]:18959 "EHLO saturn.cs.uml.edu")
	by humbolt.nl.linux.org with ESMTP id <S92343AbRB1XvW>;
	Thu, 1 Mar 2001 00:51:22 +0100
Received: (from acahalan@localhost)
	by saturn.cs.uml.edu (8.11.0/8.11.2) id f1SNlUM164458;
	Wed, 28 Feb 2001 18:47:30 -0500 (EST)
From:   "Albert D. Cahalan" <acahalan@cs.uml.edu>
Message-Id: <200102282347.f1SNlUM164458@saturn.cs.uml.edu>
Subject: Re: cluster list
To:     jamagallon@able.es (J . A . Magallon)
Date:   Wed, 28 Feb 2001 18:47:30 -0500 (EST)
Cc:     cermak@IMCS.rutgers.edu (Rob Cermak), linux-cluster@nl.linux.org
In-Reply-To: <20010228234630.A1256@werewolf.able.es> from "J . A . Magallon" at Feb 28, 2001 11:46:30 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

J . A . Magallon writes:

> First of all, I have to say that I do not know too much about kernel
> internals. [...] My main insterest is in getting a cluster built with
> low end boxes (low end relative to multiprocessing boxes, some 2-way
> pc boards) linked with 100Mb ether and its own switch.

In that case, you have everything you need and no reason to even
pay attention to this list. The existing stuff works fine.

> As everybody says, all that can be done in user space should be
> done that way.

You can do TCP/IP in userspace. You should not!

> It would be fine to have something like
> /cluster/node/0/ip
>                 mem
>                 bogomips
> /cluster/node/1/ip
> ..
> /cluster/node/self -> 1
> ..

If you share the PID space, you can put it all in /proc instead.
Having /proc/cluster/$NODE would be bad, but it is great if everything
can sanely fit into the existing /proc.

> And think about nodes in cluster being even diskless. My ideal
> cluster will be a root NFS server and nodes booting over ethernet,

Mine:

Each node consists of a CPU, local RAM, optional flash ROM, L3 cache,
and a multi-purpose chip that does DMA between nodes. There isn't any
expansion opportunity on the node, or even a PCI chip.

So stop assuming clusters have network cards! If I had one, it would
be an add-on device accesed via a dumb (non-CPU) bridge node. There
are some problems here too... which node(s) control(s) the device?
Can (should) node 42 tell the device to use data in node 33?

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 01:20:11 2001
Received: by humbolt.nl.linux.org id <S92343AbRCAATz>;
	Thu, 1 Mar 2001 01:19:55 +0100
Received: from saturn.cs.uml.edu ([129.63.8.2]:56079 "EHLO saturn.cs.uml.edu")
	by humbolt.nl.linux.org with ESMTP id <S92350AbRCAAT2>;
	Thu, 1 Mar 2001 01:19:28 +0100
Received: (from acahalan@localhost)
	by saturn.cs.uml.edu (8.11.0/8.11.2) id f210J05141359;
	Wed, 28 Feb 2001 19:19:00 -0500 (EST)
From:   "Albert D. Cahalan" <acahalan@cs.uml.edu>
Message-Id: <200103010019.f210J05141359@saturn.cs.uml.edu>
Subject: Re: cluster list
To:     riel@conectiva.com.br (Rik van Riel)
Date:   Wed, 28 Feb 2001 19:19:00 -0500 (EST)
Cc:     lindahl@conservativecomputer.com (Greg Lindahl),
        linux-cluster@nl.linux.org
In-Reply-To: <Pine.LNX.4.33.0102280044541.1961-100000@duckman.distro.conectiva> from "Rik van Riel" at Feb 28, 2001 12:54:55 AM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Rik van Riel writes:
> On Wed, 28 Feb 2001, Greg Lindahl wrote:
>> [Albert Cahalan]

>>> What type of clusters do people care to discuss, if any?
>>
>> The clusters I build for HPC are totally at user-level, and don't
>> really require any kernel changes except good fast networking and
>> maybe page coloring. Clusers are a quite diverse topic.
>
> I don't intend this list to be limited to kernel level
> things, on the contrary...

Non-kernel solutions already exist. PVM works great, doesn't it?
Discussion would be pointless.

> The more things we can do cleanly in userland, the more
> the kernel will stay "small" and maintainable ;)

Mosix is interesting. NUMA is interesting. Maintaining a single
system image across multiple kernels is interesting. Running
real-time processes is interesting.



Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 01:25:15 2001
Received: by humbolt.nl.linux.org id <S92343AbRCAAYw>;
	Thu, 1 Mar 2001 01:24:52 +0100
Received: from gw.xkey.com ([206.86.100.52]:50184 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92350AbRCAAYe>;
	Thu, 1 Mar 2001 01:24:34 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id QAA13882 for <linux-cluster@nl.linux.org>; Wed, 28 Feb 2001 16:24:31 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma013492; Wed Feb 28 16:11:08 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f210BPk02255
	for linux-cluster@nl.linux.org; Wed, 28 Feb 2001 19:11:25 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Wed, 28 Feb 2001 19:11:25 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     Linux Cluster <linux-cluster@nl.linux.org>
Subject: Re: inventory
Message-ID: <20010228191125.A2191@wumpus>
Mail-Followup-To: Linux Cluster <linux-cluster@nl.linux.org>
References: <20010228175742.A2077@wumpus> <20010228231656.22332.qmail@web9208.mail.yahoo.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <20010228231656.22332.qmail@web9208.mail.yahoo.com>; from tabmowzo@yahoo.com on Wed, Feb 28, 2001 at 03:16:56PM -0800
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Wed, Feb 28, 2001 at 03:16:56PM -0800, Peter Badovinatz wrote:

> but we still have to agree as to the positions or layout.  We also
> usually support (in fact usually demand to eliminate single points
> of failure) each node have multiple network connections, and
> multiple names, so a single hostname is not sufficient for us.

I thought that in most HA clusters, each machine has a "true name" (&
IP) that's unique and never changes, and then other names (& IP
addresses) that are associated with services that move around when
things fail? A single, unique hostname is still sufficient, and you
can make them one byte. Then you don't have any need for an extension?

-- greg


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 01:25:31 2001
Received: by humbolt.nl.linux.org id <S92350AbRCAAZJ>;
	Thu, 1 Mar 2001 01:25:09 +0100
Received: from gw.xkey.com ([206.86.100.52]:51464 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92351AbRCAAYk>;
	Thu, 1 Mar 2001 01:24:40 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id QAA13940 for <linux-cluster@nl.linux.org>; Wed, 28 Feb 2001 16:24:37 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma013587; Wed Feb 28 16:17:59 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f210IOO02272
	for linux-cluster@nl.linux.org; Wed, 28 Feb 2001 19:18:24 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Wed, 28 Feb 2001 19:18:24 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster@nl.linux.org
Subject: Re: cluster list
Message-ID: <20010228191824.B2191@wumpus>
Mail-Followup-To: linux-cluster@nl.linux.org
References: <20010228234630.A1256@werewolf.able.es> <200102282347.f1SNlUM164458@saturn.cs.uml.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <200102282347.f1SNlUM164458@saturn.cs.uml.edu>; from acahalan@cs.uml.edu on Wed, Feb 28, 2001 at 06:47:30PM -0500
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Wed, Feb 28, 2001 at 06:47:30PM -0500, Albert D. Cahalan wrote:

> Each node consists of a CPU, local RAM, optional flash ROM, L3 cache,
> and a multi-purpose chip that does DMA between nodes. There isn't any
> expansion opportunity on the node, or even a PCI chip.
> 
> So stop assuming clusters have network cards!

Your nodes are much like CPUs on the Cray T3E. Sometimes people think
of your DMA chip as a network card; I do DMA puts and gets with
Myrinet network cards. Devices on a system like this usually sit on
special CPUs and you have to DMA them a request or pass a message to
the remote OS instance in order to talk to the device, instead of
talking to the device itself.

-- g

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 01:28:49 2001
Received: by humbolt.nl.linux.org id <S92343AbRCAA2U>;
	Thu, 1 Mar 2001 01:28:20 +0100
Received: from imladris.infradead.org ([194.205.184.45]:52748 "EHLO
        infradead.org") by humbolt.nl.linux.org with ESMTP
	id <S92345AbRCAA1t>; Thu, 1 Mar 2001 01:27:49 +0100
Received: from jalon.able.es ([212.97.163.2])
	by infradead.org with esmtp (Exim 3.20 #2)
	id 14YGwd-0004Oo-00
	for linux-cluster@nl.linux.org; Thu, 01 Mar 2001 00:27:48 +0000
Received: from correo.able.es ([212.97.169.185]) by
          jalon.able.es (Netscape Messaging Server 4.15) with SMTP id
          G9HTAT00.4CB; Thu, 1 Mar 2001 01:28:05 +0100 
Date:   Thu, 1 Mar 2001 01:27:14 +0100
From:   "J . A . Magallon" <jamagallon@able.es>
To:     "Albert D . Cahalan" <acahalan@cs.uml.edu>
Cc:     "J . A . Magallon" <jamagallon@able.es>,
        Rob Cermak <cermak@IMCS.rutgers.edu>,
        linux-cluster@nl.linux.org
Subject: Re: cluster list
Message-ID: <20010301012714.C1256@werewolf.able.es>
References: <20010228234630.A1256@werewolf.able.es> <200102282347.f1SNlUM164458@saturn.cs.uml.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
In-Reply-To: <200102282347.f1SNlUM164458@saturn.cs.uml.edu>; from acahalan@cs.uml.edu on Thu, Mar 01, 2001 at 00:47:30 +0100
X-Mailer: Balsa 1.1.1
Content-Length: 1700
Lines:  43
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


On 03.01 Albert D. Cahalan wrote:
> 
> In that case, you have everything you need and no reason to even
> pay attention to this list. The existing stuff works fine.
>

But each one works its way. I thought this list will try to unify all
approaches in the base things they need.
BTW, i have been looking for a DSM implementation, and found none. And
bproc only works in 2.2.

> 
> Mine:
> 
> Each node consists of a CPU, local RAM, optional flash ROM, L3 cache,
> and a multi-purpose chip that does DMA between nodes. There isn't any
> expansion opportunity on the node, or even a PCI chip.
> 
> So stop assuming clusters have network cards! If I had one, it would
> be an add-on device accesed via a dumb (non-CPU) bridge node. There
> are some problems here too... which node(s) control(s) the device?
> Can (should) node 42 tell the device to use data in node 33?
>

Well, focus on that kind of special-purpose nodes and how many people will
test-use it, apart from you and your special-dma-chip supplier ?

I think about nodes connected by some link; the easiest is ethernet (if the
net is dedicated it does not to use tcp), but the communication device
sould be something configurable, say 'node, start cluster membership
on dev=xxx'.

If you boot each node diskless, clusterfs can read /etc/something
where the device is marked, and a map between 'address' (ip or whatever)
and node nunmber is given. In your case that map would be in flash, mine would
be on nfs.

-- 
J.A. Magallon                                                      $> cd pub
mailto:jamagallon@able.es                                          $> more beer

Linux werewolf 2.4.2-ac6 #1 SMP Wed Feb 28 01:53:51 CET 2001 i686


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 01:32:33 2001
Received: by humbolt.nl.linux.org id <S92345AbRCAAcP>;
	Thu, 1 Mar 2001 01:32:15 +0100
Received: from imladris.infradead.org ([194.205.184.45]:53516 "EHLO
        infradead.org") by humbolt.nl.linux.org with ESMTP
	id <S92343AbRCAAbv>; Thu, 1 Mar 2001 01:31:51 +0100
Received: from jalon.able.es ([212.97.163.2])
	by infradead.org with esmtp (Exim 3.20 #2)
	id 14YH0Y-0004QH-00
	for linux-cluster@nl.linux.org; Thu, 01 Mar 2001 00:31:50 +0000
Received: from correo.able.es ([212.97.169.185]) by
          jalon.able.es (Netscape Messaging Server 4.15) with SMTP id
          G9HTHL00.K6L; Thu, 1 Mar 2001 01:32:09 +0100 
Date:   Thu, 1 Mar 2001 01:31:19 +0100
From:   "J . A . Magallon" <jamagallon@able.es>
To:     "Albert D . Cahalan" <acahalan@cs.uml.edu>
Cc:     linux-cluster@nl.linux.org
Subject: Re: cluster list
Message-ID: <20010301013119.E1256@werewolf.able.es>
References: <Pine.LNX.4.33.0102280044541.1961-100000@duckman.distro.conectiva> <200103010019.f210J05141359@saturn.cs.uml.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
In-Reply-To: <200103010019.f210J05141359@saturn.cs.uml.edu>; from acahalan@cs.uml.edu on Thu, Mar 01, 2001 at 01:19:00 +0100
X-Mailer: Balsa 1.1.1
Content-Length: 816
Lines:  24
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


On 03.01 Albert D. Cahalan wrote:
> 
> Non-kernel solutions already exist. PVM works great, doesn't it?
> Discussion would be pointless.
>

I don't really consider PVM as 'clustering'. You have to do everything
of your own.

> Mosix is interesting. NUMA is interesting. Maintaining a single
> system image across multiple kernels is interesting. Running
> real-time processes is interesting.
> 

NUMA is my better choice. Just setup a common address space, launch a
thread that can go to another node, and let the thread access memory
without knowledge of its locality o distance away.

-- 
J.A. Magallon                                                      $> cd pub
mailto:jamagallon@able.es                                          $> more beer

Linux werewolf 2.4.2-ac6 #1 SMP Wed Feb 28 01:53:51 CET 2001 i686


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 02:08:04 2001
Received: by humbolt.nl.linux.org id <S92356AbRCABHZ>;
	Thu, 1 Mar 2001 02:07:25 +0100
Received: from gw.xkey.com ([206.86.100.52]:36874 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92351AbRCABHF>;
	Thu, 1 Mar 2001 02:07:05 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id RAA16610 for <linux-cluster@nl.linux.org>; Wed, 28 Feb 2001 17:07:03 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma016244; Wed Feb 28 16:59:23 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f210xmf02380
	for linux-cluster@nl.linux.org; Wed, 28 Feb 2001 19:59:48 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Wed, 28 Feb 2001 19:59:48 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster@nl.linux.org
Subject: Re: cluster list
Message-ID: <20010228195948.C2191@wumpus>
Mail-Followup-To: linux-cluster@nl.linux.org
References: <20010228234630.A1256@werewolf.able.es> <200102282347.f1SNlUM164458@saturn.cs.uml.edu> <20010301012714.C1256@werewolf.able.es>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <20010301012714.C1256@werewolf.able.es>; from jamagallon@able.es on Thu, Mar 01, 2001 at 01:27:14AM +0100
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Thu, Mar 01, 2001 at 01:27:14AM +0100, J . A . Magallon wrote:

> BTW, i have been looking for a DSM implementation, and found none.

See: http://www.cs.umd.edu/users/keleher/dsm.html. Presumably some of
those work on Linux, I don't know which.

> And bproc only works in 2.2.

Scyld expects to release it for 2.4 in a couple of weeks.

-- g

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 02:08:24 2001
Received: by humbolt.nl.linux.org id <S92343AbRCABHh>;
	Thu, 1 Mar 2001 02:07:37 +0100
Received: from gw.xkey.com ([206.86.100.52]:36362 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92345AbRCABHF>;
	Thu, 1 Mar 2001 02:07:05 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id RAA16608 for <linux-cluster@nl.linux.org>; Wed, 28 Feb 2001 17:07:03 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma016477; Wed Feb 28 17:01:35 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f21121M02400
	for linux-cluster@nl.linux.org; Wed, 28 Feb 2001 20:02:01 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Wed, 28 Feb 2001 20:02:01 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster@nl.linux.org
Subject: Re: cluster list
Message-ID: <20010228200201.D2191@wumpus>
Mail-Followup-To: linux-cluster@nl.linux.org
References: <Pine.LNX.4.33.0102280044541.1961-100000@duckman.distro.conectiva> <200103010019.f210J05141359@saturn.cs.uml.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <200103010019.f210J05141359@saturn.cs.uml.edu>; from acahalan@cs.uml.edu on Wed, Feb 28, 2001 at 07:19:00PM -0500
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Wed, Feb 28, 2001 at 07:19:00PM -0500, Albert D. Cahalan wrote:

> Non-kernel solutions already exist. PVM works great, doesn't it?
> Discussion would be pointless.

If someone wants highly-available PVM on a cluster, discussion would
be needed. Standard PVM doesn't do that. I know Condor has the ability
to migrate PVM processes intelligently, but not failover.

It's when you want to extend something that does one thing well, like
PVM does parallel computations well, you run into more general cluster
issues.

-- g


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 02:30:39 2001
Received: by humbolt.nl.linux.org id <S92355AbRCABa2>;
	Thu, 1 Mar 2001 02:30:28 +0100
Received: from gw.xkey.com ([206.86.100.52]:35339 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92345AbRCABaE>;
	Thu, 1 Mar 2001 02:30:04 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id RAA18222 for <linux-cluster@nl.linux.org>; Wed, 28 Feb 2001 17:30:02 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma018083; Wed Feb 28 17:25:47 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f211Q8E02432
	for linux-cluster@nl.linux.org; Wed, 28 Feb 2001 20:26:08 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Wed, 28 Feb 2001 20:26:08 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster@nl.linux.org
Subject: What is a Cluster?
Message-ID: <20010228202608.E2191@wumpus>
Mail-Followup-To: linux-cluster@nl.linux.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

By the way, one thing that needs to get defined is the definition of
"Cluster". I recommend that you all run (not walk) to buy Greg
Pfister's book "In Search Of Clusters" and read it. He defines 3
classes:

1) High Availability Clusters
2) High Performance Clusters (run a single program fast)
3) High Throughput Clusters (run a zillion non-parallel programs)

An example of (1) is a 2-node failover mail server.
An example of (2) is a Beowulf running a single big mpi job.
An example of (3) is a Beowulf running a bunch of single-cpu gene
  comparison jobs.

Some clusters are combinations of these. Most webserver front-ends are
a combination of (3) and (1): you distribute the load of hits over
many nodes, and if one goes down, you don't send it new hits.

I think these definitions help give a good idea of the scope that
cluster services can serve. It's often the case that the community
that works on (1) never talks to the community that does (2) or
(3). And people doing (2) and (3) for enterprise apps often don't talk
to the people doing (2) or (3) for technical computing. It would be
nice if this list turned into a forum where various groups could meet
and figure out how we can help each other.

-- greg


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 03:10:53 2001
Received: by humbolt.nl.linux.org id <S92350AbRCACKd>;
	Thu, 1 Mar 2001 03:10:33 +0100
Received: from saturn.cs.uml.edu ([129.63.8.2]:5138 "EHLO saturn.cs.uml.edu")
	by humbolt.nl.linux.org with ESMTP id <S92343AbRCACJv>;
	Thu, 1 Mar 2001 03:09:51 +0100
Received: (from acahalan@localhost)
	by saturn.cs.uml.edu (8.11.0/8.11.2) id f2128vn124745;
	Wed, 28 Feb 2001 21:08:58 -0500 (EST)
From:   "Albert D. Cahalan" <acahalan@cs.uml.edu>
Message-Id: <200103010208.f2128vn124745@saturn.cs.uml.edu>
Subject: Re: cluster list
To:     jamagallon@able.es (J . A . Magallon)
Date:   Wed, 28 Feb 2001 21:08:57 -0500 (EST)
Cc:     acahalan@cs.uml.edu (Albert D . Cahalan),
        jamagallon@able.es (J . A . Magallon),
        cermak@IMCS.rutgers.edu (Rob Cermak), linux-cluster@nl.linux.org
In-Reply-To: <20010301012714.C1256@werewolf.able.es> from "J . A . Magallon" at Mar 01, 2001 01:27:14 AM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

>> Mine:
>>
>> Each node consists of a CPU, local RAM, optional flash ROM, L3 cache,
>> and a multi-purpose chip that does DMA between nodes. There isn't any
>> expansion opportunity on the node, or even a PCI chip.
>>
>> So stop assuming clusters have network cards! If I had one, it would
>> be an add-on device accesed via a dumb (non-CPU) bridge node. There
>> are some problems here too... which node(s) control(s) the device?
>> Can (should) node 42 tell the device to use data in node 33?
>
> Well, focus on that kind of special-purpose nodes and how many people
> will test-use it, apart from you and your special-dma-chip supplier ?

I'd guess much of the code could be used on large Sun and SGI boxes.
Treating a 64-way system as SMP is going to hurt, even if it works.

Wasn't Larry McVoy proposing something along these lines? As I recall
it, the example was:

You have a NUMA box with 16 processors.
Processors are in groups of 4.
You run 4 kernels, each doing 4-way SMP.
You have kernel code to make the system appear unified.

The numbers could be from a big Intel box. (the 4-way Xeon limit)
The idea would work for Sun or SGI boxes. Depending on the
details, it ought to work on my hardware too.

(there is a summit meeting on this soon... wish I was invited)

> I think about nodes connected by some link; the easiest is ethernet
> (if the net is dedicated it does not to use tcp), but the communication
> device sould be something configurable, say 'node, start cluster
> membership on dev=xxx'.

Hardware really defines what you can and should be doing. For all
of the following, assume we want to run more than one kernel.
So "SMP" means a true SMP box that we split up for performance.
I use "mail" to mean data that gets queued on the receiver and
causes an interrupt, or data that is passed into the CPU to avoid
the need for queueing.

CPU-mapped access to memory on other nodes?
  No. (your Ethernet)
  Yes, for free. (large SMP system -- your memory is my memory)
  Yes, almost free. (huge-memory SMP system with a 32-bit CPU)
  Yes, slowly. (NUMA system)
  Yes, slowly, and you can't ignore it. (my system w/ huge memory)

DMA transfer access to memory on other nodes?
  No. (your Ethernet, typical SMP, and maybe some NUMA boxes)
  Yes. (my system, and maybe SCI networks or NUMA boxes)

Can "mail" a small (few bits or bytes) message to other nodes?
  Yes, but unreliable and with horrible latency. (your Ethernet)
  Yes, syncronously, w/o broadcast. (SMP IPI, maybe NUMA too)
  Yes, about 1000 may be queued before overflow. (my system)

Can "mail" a large (several bytes or kilobytes) message to other nodes?
  Yes, unreliably. (your Ethernet)
  No. (SMP, NUMA, my system... no network card to queue it!)

Can broadcast large "mail"?
  Yes, no problem. (your Ethernet)
  No. (SMP, NUMA, my system)

Can broadcast small "mail"?
  Yes, no problem. (your Ethernet)
  Yes. (SMP, NUMA)
  Yes, s-l-o-w-l-y. (my system)

Can broadcast DMA writes?
  No; no DMA. (your Ethernet, SMP, and maybe some NUMA boxes)
  Yes, slowly. (my system)

Can broadcast CPU-mapped access?
  No; no DMA access. (your Ethernet)
  No. (SMP and NUMA)
  Yes, s-l-o-w-l-y. (my system -- heh, don't try to broadcast read)

Take the above hardware differences, then multiply by 2 to the
power of the number of features (failover, process migration)
that people might want. Then considering that people work for
competing organizations with secrets and incompatible human
languages, it should be no surprise that little work is shared.

> If you boot each node diskless, clusterfs can read /etc/something
> where the device is marked, and a map between 'address' (ip or
> whatever) and node nunmber is given. In your case that map would
> be in flash, mine would be on nfs.

In your case, get rid of it.

Buy all your Ethernet cards from the same vendor, so only the last
three bytes of the MAC addresses differ. Those bytes are a 24-bit
node number. Assign IP addresses from the 10.x.y.z private-use area
according to node ID.

If your MAC is 0c:ad:13:00:02:05 then your node ID is 517 and
your IP address is 10.0.2.5.

Oh, such a hack... but for a wee bit of code, you get rid of all
the lookup tables. You can even kill ARP.

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 03:26:17 2001
Received: by humbolt.nl.linux.org id <S92350AbRCACZ7>;
	Thu, 1 Mar 2001 03:25:59 +0100
Received: from gw.xkey.com ([206.86.100.52]:1295 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92345AbRCACZh>;
	Thu, 1 Mar 2001 03:25:37 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id SAA22132 for <linux-cluster@nl.linux.org>; Wed, 28 Feb 2001 18:25:35 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma022124; Wed Feb 28 18:25:27 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f212Prf02571
	for linux-cluster@nl.linux.org; Wed, 28 Feb 2001 21:25:53 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Wed, 28 Feb 2001 21:25:53 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster@nl.linux.org
Subject: Re: cluster list
Message-ID: <20010228212553.A2564@wumpus>
Mail-Followup-To: linux-cluster@nl.linux.org
References: <20010301012714.C1256@werewolf.able.es> <200103010208.f2128vn124745@saturn.cs.uml.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <200103010208.f2128vn124745@saturn.cs.uml.edu>; from acahalan@cs.uml.edu on Wed, Feb 28, 2001 at 09:08:57PM -0500
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Wed, Feb 28, 2001 at 09:08:57PM -0500, Albert D. Cahalan wrote:

> Wasn't Larry McVoy proposing something along these lines? As I recall
> it, the example was:
> 
> You have a NUMA box with 16 processors.
> Processors are in groups of 4.
> You run 4 kernels, each doing 4-way SMP.
> You have kernel code to make the system appear unified.
> 
> The numbers could be from a big Intel box. (the 4-way Xeon limit)
> The idea would work for Sun or SGI boxes. Depending on the
> details, it ought to work on my hardware too.
> 
> (there is a summit meeting on this soon... wish I was invited)

Yes, Larry has been seeking to trade the problem of too many kernel
locks for the problem of too much concurrency. Those of us who have
worked on distributed systems aren't so sure he's headed in the right
direction. But it's an interesting idea.

But, like every other approach, it's going to have limits even if it
works the way Larry thinks it will. For example, Larry still needs a
global filesystem.

-- g


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 03:39:09 2001
Received: by humbolt.nl.linux.org id <S92352AbRCACiq>;
	Thu, 1 Mar 2001 03:38:46 +0100
Received: from saturn.cs.uml.edu ([129.63.8.2]:39954 "EHLO saturn.cs.uml.edu")
	by humbolt.nl.linux.org with ESMTP id <S92351AbRCACiK>;
	Thu, 1 Mar 2001 03:38:10 +0100
Received: (from acahalan@localhost)
	by saturn.cs.uml.edu (8.11.0/8.11.2) id f212c0i129405;
	Wed, 28 Feb 2001 21:38:00 -0500 (EST)
From:   "Albert D. Cahalan" <acahalan@cs.uml.edu>
Message-Id: <200103010238.f212c0i129405@saturn.cs.uml.edu>
Subject: Re: cluster list
To:     lindahl@conservativecomputer.com (Greg Lindahl)
Date:   Wed, 28 Feb 2001 21:38:00 -0500 (EST)
Cc:     linux-cluster@nl.linux.org
In-Reply-To: <20010228191824.B2191@wumpus> from "Greg Lindahl" at Feb 28, 2001 07:18:24 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Greg Lindahl writes:
> On Wed, Feb 28, 2001 at 06:47:30PM -0500, Albert D. Cahalan wrote:

>> Each node consists of a CPU, local RAM, optional flash ROM, L3 cache,
>> and a multi-purpose chip that does DMA between nodes. There isn't any
>> expansion opportunity on the node, or even a PCI chip.
>>
>> So stop assuming clusters have network cards!
>
> Your nodes are much like CPUs on the Cray T3E. Sometimes people think
> of your DMA chip as a network card; I do DMA puts and gets with
> Myrinet network cards.

To make things clear, what behaviors can you get?

1 initiator==sender, data queued on destination
2 initiator==sender, initiator chooses memory location on destination
3 initiator==destination, data queued on destination
4 initiator==destination, choosing where to put the data in advance
5 initiator==3rd-party, data queued on destination
6 initiator==3rd-party, initiator chooses memory location on destination

I get #2 and #4 between CPUs. Maybe #6 works too, if I want to risk
remote control of another node's DMA engine. All of #2, #4, and #6
would work fine with devices, including a variant of #6 where I call
myself the 3rd party.

> Devices on a system like this usually sit on
> special CPUs and you have to DMA them a request or pass a message to
> the remote OS instance in order to talk to the device, instead of
> talking to the device itself.

Not for me. I get a bridge to PCI. I guess you could say it has
an IO MMU, since the initiator (any node in the system) must set up
the bridge to direct PCI DMA to/from the right nodes. PCI interrupts
are sent to nodes as mail; they might not go to the node getting data.
(so node 18, getting the interrupt mail and controlling a SCSI device,
could cause SCSI transfers directly to/from node 101)

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 05:04:57 2001
Received: by humbolt.nl.linux.org id <S92171AbRCAEEb>;
	Thu, 1 Mar 2001 05:04:31 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:7669 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92170AbRCAEEN>; Thu, 1 Mar 2001 05:04:13 +0100
Received: from unix.sh (localhost [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id E46AF17779; Wed, 28 Feb 2001 21:03:49 -0700 (MST)
Message-ID: <3A9DCA25.B8848C20@unix.sh>
Date:   Wed, 28 Feb 2001 21:03:49 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     Greg Lindahl <lindahl@conservativecomputer.com>
Cc:     Linux Cluster <linux-cluster@nl.linux.org>
Subject: Re: inventory
References: <20010228175742.A2077@wumpus> <20010228231656.22332.qmail@web9208.mail.yahoo.com> <20010228191125.A2191@wumpus>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Greg Lindahl wrote:
> 
> On Wed, Feb 28, 2001 at 03:16:56PM -0800, Peter Badovinatz wrote:
> 
> > but we still have to agree as to the positions or layout.  We also
> > usually support (in fact usually demand to eliminate single points
> > of failure) each node have multiple network connections, and
> > multiple names, so a single hostname is not sufficient for us.
> 
> I thought that in most HA clusters, each machine has a "true name" (&
> IP) that's unique and never changes, and then other names (& IP
> addresses) that are associated with services that move around when
> things fail?

Generally that's the model.

> A single, unique hostname is still sufficient, and you
> can make them one byte. Then you don't have any need for an extension?

Making things one byte long is not useful for large clusters and unaesthetic
to say the least.

One can give one's nodes names.  One can number one's nodes a set of dense
node numbers.  There is a purpose to each.  Node numbering typically needs
to be dense.  Node names typically need to be permanent (for tracking
failures, etc).

Here's why one might want a dense set of integral node numbers:
Sending around bitmaps for determining cluster membership.

Here's why one migh want node names:
A permanent name for a node which persists as you add and delete nodes to
the cluster and allows you to track what's on the given node over time.  So,
if you see that you've had 3 crashes on node foo, that it's always been the
same computational unit, and not 3 different ones (which could happen if one
adds or deletes nodes to the system).

So, it seems pretty clear to me that one wants both node names and node
numbers, and a canonical way to refer to a node in either domain.

Then the question comes up:
Which way do you refer to the node in which context, and how do you get from
one domain to the other?

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 05:25:40 2001
Received: by humbolt.nl.linux.org id <S92170AbRCAEZY>;
	Thu, 1 Mar 2001 05:25:24 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:21493 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92166AbRCAEY6>; Thu, 1 Mar 2001 05:24:58 +0100
Received: from unix.sh (localhost [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP id 94F6717779
	for <linux-cluster@nl.linux.org>; Wed, 28 Feb 2001 21:24:34 -0700 (MST)
Message-ID: <3A9DCF01.66AE8403@unix.sh>
Date:   Wed, 28 Feb 2001 21:24:33 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     linux-cluster <linux-cluster@nl.linux.org>
Subject: High Availability versus Automatic Process Migration
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

There are two problems associated with automatic process migration when
considered in the context of HA:

1) The HA cluster manager wants to control what's running where *all the
time*, 
	so it can take proper recovery actions and guarantee the
	paranoid-by-definition customer what they want in terms of migration
	strategies.

2) Some kinds of process migration (like Mosix) are low-availability
solutions:
	if every process has migrated from its original machine loss of any
	node can kill all processes (for a 2-node system).

Successful HA customers often have these characteristics:
	Availability and data integrity are EVERYTHING
	control-freaks
	anal-retentive
	paranoid
	perfectionists

I don't know enough about HPC customers, but I don't associate these
characteristics with the HPC arena.

My guess is that it will be these differences that push the solutions
farther apart and make them to separate market niches - even if all the
technology could otherwise be common.

Having said that, there are MANY possible common elements between HA and HPC
clusters, and many common solutions are possible.

A few examples come readily to mind:
	Cluster membership and corresponding event APIs
	single-image boot
	cluster filesystems
	system monitoring
	node reset mechanisms (i.e., Stonith)

Unless a given feature REQUIRES a kernel implementation by definition, I
would strongly recommend against using /proc-like interfaces for user
programs, but instead define an API which can be easily and sensibly
implemented by user-level programs.

Of course, if there is a /proc-thinggie around, the API could just turn
around and ask /proc (through a plug in model).  BUT, the applications
shouldn't be doing this themselves.

Heartbeat is a user-space cluster manager.  We originally implemented a
/proc interface for it because it was cool, and could be common with kernel
implementations.  It was also a mistake, and has been dropped.  I now
believe that doing it the other way around makes more sense.

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 06:11:55 2001
Received: by humbolt.nl.linux.org id <S92173AbRCAFL2>;
	Thu, 1 Mar 2001 06:11:28 +0100
Received: from gw.xkey.com ([206.86.100.52]:48389 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92171AbRCAFLK>;
	Thu, 1 Mar 2001 06:11:10 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id VAA31591 for <linux-cluster@nl.linux.org>; Wed, 28 Feb 2001 21:11:08 -0800
Received: from unknown(64.134.22.31) by happy.xkey.com via smtp (V1.3)
	id sma031586; Wed Feb 28 21:10:59 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f215BPZ01389
	for linux-cluster@nl.linux.org; Thu, 1 Mar 2001 00:11:25 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Thu, 1 Mar 2001 00:11:25 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster <linux-cluster@nl.linux.org>
Subject: Re: High Availability versus Automatic Process Migration
Message-ID: <20010301001125.A1361@wumpus.int.den.wayport.net>
Mail-Followup-To: linux-cluster <linux-cluster@nl.linux.org>
References: <3A9DCF01.66AE8403@unix.sh>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3A9DCF01.66AE8403@unix.sh>; from alanr@unix.sh on Wed, Feb 28, 2001 at 09:24:33PM -0700
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Wed, Feb 28, 2001 at 09:24:33PM -0700, Alan Robertson wrote:

> I don't know enough about HPC customers, but I don't associate these
> characteristics with the HPC arena.

You'd be surprised. But even the national weather service doesn't mind
if their forecast takes twice as long to run on (rare) occasion; they
just want the answer almost all of the time in the allowed window, and
they throw extra cpus at the problem to get the time down to a small
fraction of the window. Then a rerun due to failure doesn't violate
the deadline.

So, the timescale for getting the right answer is different. For most
commercial HA clusters, you want to transfer in fractions of seconds
or seconds. For a HPC system, well, if I have to occasionally restart
that 100 node job from the beginnig, it's not the end of the
world... I just don't want the user to get back a failure because a
node died.

Not surprisingly, this affects cost. The HPC version of weak HA
requires little extra equipment. But even on my HPC system, I'd like
to have my admin node with the queue system and etc be a 2-node HA
system. Not to mention the controller for my parallel filesystem and
mass-store... as long as it's cheap enough.

> A few examples come readily to mind:
> 	Cluster membership and corresponding event APIs
> 	single-image boot
> 	cluster filesystems
> 	system monitoring
> 	node reset mechanisms (i.e., Stonith)

These are related, yes. Gee, I never realized Stonith needed a name...
I use APC masterswitches so I can remotely power cycle nodes, just for
system admin convenience.

-- g

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 06:17:36 2001
Received: by humbolt.nl.linux.org id <S92170AbRCAFRR>;
	Thu, 1 Mar 2001 06:17:17 +0100
Received: from web9205.mail.yahoo.com ([216.136.129.38]:54026 "HELO
        web9205.mail.yahoo.com") by humbolt.nl.linux.org with SMTP
	id <S92166AbRCAFQ5>; Thu, 1 Mar 2001 06:16:57 +0100
Message-ID: <20010301051648.60582.qmail@web9205.mail.yahoo.com>
Received: from [32.101.83.124] by web9205.mail.yahoo.com; Wed, 28 Feb 2001 21:16:48 PST
Date:   Wed, 28 Feb 2001 21:16:48 -0800 (PST)
From:   Peter Badovinatz <tabmowzo@yahoo.com>
Subject: Re: inventory
To:     Linux Cluster <linux-cluster@nl.linux.org>
In-Reply-To: <20010228191125.A2191@wumpus>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


--- Greg Lindahl <lindahl@conservativecomputer.com> wrote:
> On Wed, Feb 28, 2001 at 03:16:56PM -0800, Peter Badovinatz wrote:
> 
> > but we still have to agree as to the positions or layout.  We also
> > usually support (in fact usually demand to eliminate single points
> > of failure) each node have multiple network connections, and
> > multiple names, so a single hostname is not sufficient for us.
> 
> I thought that in most HA clusters, each machine has a "true name" (&
> IP) that's unique and never changes, and then other names (& IP
> addresses) that are associated with services that move around when
> things fail? A single, unique hostname is still sufficient, and you
> can make them one byte. Then you don't have any need for an extension?
> 
That is most common, yes.  However, there is one example (with which I'm
familiar) that this is NOT the case:  HACMP on AIX.  All IP addresses are
migratable, there is no persistent node name for any machine.  It takes the
node number, held in a local config file, and follows some agreement protocols
to "advertise" the IP addresses which it owns at the time it boots.  It is
common for the node to have a persistent hostname/IP address, but, HACMP (which
is managing all of the HA) never uses that name and doesn't control it, so if
the adapter to which it's connected dies, that address dies.

This does not mean this is the right or the wrong model, it is a model.

> -- greg
> 
> 
> Linux-cluster: generic cluster infrastructure for Linux
> Archive:       http://mail.nl.linux.org/linux-cluster/


=====
These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
wombat@us.ibm.com/tabmowzo@yahoo.com
and in no way should be construed as official opinion of 
IBM, Corp.

__________________________________________________
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail. 
http://personal.mail.yahoo.com/

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 06:54:43 2001
Received: by humbolt.nl.linux.org id <S92175AbRCAFye>;
	Thu, 1 Mar 2001 06:54:34 +0100
Received: from hilbert.umkc.edu ([134.193.4.60]:8976 "HELO tesla.umkc.edu")
	by humbolt.nl.linux.org with SMTP id <S92170AbRCAFyO>;
	Thu, 1 Mar 2001 06:54:14 +0100
Received: (qmail 456921 invoked from network); 1 Mar 2001 05:53:21 -0000
Received: from nicol1.umkc.edu (HELO kasey.umkc.edu) (david@134.193.4.62)
  by hilbert.umkc.edu with SMTP; 1 Mar 2001 05:53:21 -0000
Message-ID: <3A9DE3CF.A81A7EBE@kasey.umkc.edu>
Date:   Wed, 28 Feb 2001 23:53:19 -0600
From:   "David L. Nicol" <david@kasey.umkc.edu>
Organization: University of Missouri - Kansas City   supercomputing infrastructure
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i586)
X-Accept-Language: en
MIME-Version: 1.0
To:     linux-cluster@nl.linux.org
Subject: NonUniformMemoryAccess and swapping
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


do we have hot-tunable swap device priority yet?

Idea: if each machine offers a "network device" to each
other, for use as a swap space in addition to their own disk,
that might be a faster Big VM than swapping to node-local disk.

shared memory still needs to rondezvous through a single control
point though.

-- 
                      David Nicol 816.235.1187 dnicol@cstp.umkc.edu
                           Damn! Someone stole my book on security!


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 07:26:29 2001
Received: by humbolt.nl.linux.org id <S92173AbRCAG0N>;
	Thu, 1 Mar 2001 07:26:13 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:33270 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92175AbRCAGZp>; Thu, 1 Mar 2001 07:25:45 +0100
Received: from unix.sh (localhost [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id 854741778C; Wed, 28 Feb 2001 23:25:21 -0700 (MST)
Message-ID: <3A9DEB50.29AAABDE@unix.sh>
Date:   Wed, 28 Feb 2001 23:25:20 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     Greg Lindahl <lindahl@conservativecomputer.com>
Cc:     linux-cluster <linux-cluster@nl.linux.org>
Subject: Re: High Availability versus Automatic Process Migration
References: <3A9DCF01.66AE8403@unix.sh> <20010301001125.A1361@wumpus.int.den.wayport.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Greg Lindahl wrote:
> 
> On Wed, Feb 28, 2001 at 09:24:33PM -0700, Alan Robertson wrote:
> 
> > I don't know enough about HPC customers, but I don't associate these
> > characteristics with the HPC arena.
> 
> You'd be surprised. But even the national weather service doesn't mind
> if their forecast takes twice as long to run on (rare) occasion; they
> just want the answer almost all of the time in the allowed window, and
> they throw extra cpus at the problem to get the time down to a small
> fraction of the window. Then a rerun due to failure doesn't violate
> the deadline.
> 
> So, the timescale for getting the right answer is different. For most
> commercial HA clusters, you want to transfer in fractions of seconds
> or seconds. For a HPC system, well, if I have to occasionally restart
> that 100 node job from the beginnig, it's not the end of the
> world... I just don't want the user to get back a failure because a
> node died.

I do understand.  It's just that HA by it's very nature systematically
attracts customers who are unusually paranoid ;-)   (Although several
examples of HA/HPC systems come to mind).

> Not surprisingly, this affects cost. The HPC version of weak HA
> requires little extra equipment. But even on my HPC system, I'd like
> to have my admin node with the queue system and etc be a 2-node HA
> system. Not to mention the controller for my parallel filesystem and
> mass-store... as long as it's cheap enough.

But of course, having a single node as a parallel filesystem controller or
mass store is probably not a very scaleable design (when compared to GFS for
example) ;-)

> > A few examples come readily to mind:
> >       Cluster membership and corresponding event APIs
> >       single-image boot
> >       cluster filesystems
> >       system monitoring
> >       node reset mechanisms (i.e., Stonith)
> 
> These are related, yes. Gee, I never realized Stonith needed a name...
> I use APC masterswitches so I can remotely power cycle nodes, just for
> system admin convenience.

STONITH == Shoot The Other Node In The Head - a memorable acronym.  In the
HA case, you probably want something better than the APC switches, since
they only take one power input, so your power doesn't become an SPOF. 
[i.e., the APC switches aren't sufficiently paranoid ;-)]

Stonith means basically that one node can reset another node under program
control.  This is a little different than the desire to do it manually.  We
actually have a library and an API for doing this for several kinds of
mechanisms.

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 09:00:32 2001
Received: by humbolt.nl.linux.org id <S92171AbRCAIAM>;
	Thu, 1 Mar 2001 09:00:12 +0100
Received: from [195.22.75.68] ([195.22.75.68]:25723 "EHLO beregond")
	by humbolt.nl.linux.org with ESMTP id <S92170AbRCAH7k>;
	Thu, 1 Mar 2001 08:59:40 +0100
Received: from arrowhead.se (IDENT:joh@beregond [127.0.0.1])
	by beregond (8.11.0/8.11.0) with ESMTP id f217wfQ08733
	for <linux-cluster@nl.linux.org>; Thu, 1 Mar 2001 08:58:41 +0100
Message-ID: <3A9E0130.A5A7E93E@arrowhead.se>
Date:   Thu, 01 Mar 2001 08:58:40 +0100
From:   Josef =?iso-8859-1?Q?H=F6=F6k?= <josef.hook@arrowhead.se>
Organization: Arrowhead
X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.0-SGI_XFS_PRsmp i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     linux-cluster@nl.linux.org
Subject: Re: cluster list
References: <200102281717.f1SHHYe54669@saturn.cs.uml.edu> <Pine.SOL.4.21.0102281332180.28777-100000@imcs.rutgers.edu> <20010228234630.A1256@werewolf.able.es>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

"J . A . Magallon" wrote:

> On 02.28 Rob Cermak wrote:
> >
> > The question posed by the kernel group:
> >   Is there a way to add to the existing monilitic kernel to
> >   satisfy the needs of these groups?  Common API to handle
> >   process, memory, network sharing in cluster arrangements.
> >
> > It would be nice if there was a combination of kernel modules
> > and user-space tools not requiring a whole hip replacement.
> >
>
> First of all, I have to say that I do not know too much about kernel
> internals. I work in realistic image synthesys, and I have written threaded
> programs in SMP shared mem boxes, worked with message passing packages and
> worked slightly with things like POE in SP2. My main insterest is in
> getting a cluster built with low end boxes (low end relative to multiprocessing
> boxes, some 2-way pc boards) linked with 100Mb ether and its own switch.
> University budgets do not give too much space to dream with 64-way SGI or
> Sun nodes.
>
> As everybody says, all that can be done in user space should be done that
> way.
>
> But there are many things that all packages do that will be faster if
> done in kernel space. And some that have to be done in kernel if you
> want certain type of clustering.
>
> For example, PVM or MPI configure clusters at user level, but if you want
> to use DSM or NUMA (with one level being other node), the kernel has to move
> processes or data, so kernel needs to know about the cluster.
>
> I think the first thing that sould be analyzed (as someone posted previously)
> is how each package defines node groups to build a cluster and give a common
> interface available for all of them. Each package has its own /etc/nodes.cfg
> or similar.
>
> It would be fine to have something like
> /cluster/node/0/ip
>                 mem
>                 bogomips
> /cluster/node/1/ip
> ..
> /cluster/node/self -> 1
> ..
>

That surely reminds me of Plan9 structure.
What about having it like this instead.
    /cluster/node/0
                           ctl
                           data
                           listen
                           local
                           remote
                           status
and in

/cluste/node
                      0
                      1
                      clone  (adding a new  node to the system)

as in plan9 it would be possible todo

merari% cat local remote status
192.168.11.10
192.168.11.11
jadi jadi jadi.....
some info...
merari%

/joh


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 14:16:29 2001
Received: by humbolt.nl.linux.org id <S92171AbRCANPs>;
	Thu, 1 Mar 2001 14:15:48 +0100
Received: from brutus.conectiva.com.br ([200.250.58.146]:27379 "EHLO
        thor.distro.conectiva") by humbolt.nl.linux.org with ESMTP
	id <S92170AbRCANPR>; Thu, 1 Mar 2001 14:15:17 +0100
Received: by thor.distro.conectiva (Postfix, from userid 573)
	id 1180E403C; Thu,  1 Mar 2001 10:15:05 -0300 (EST)
Date:   Thu, 1 Mar 2001 10:15:05 -0300
From:   Fabio Olive Leite <olive@conectiva.com.br>
To:     linux-cluster@nl.linux.org
Subject: Re: What is a Cluster?
Message-ID: <20010301101505.D3129@conectiva.com.br>
References: <20010228202608.E2191@wumpus>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.3.14i
In-Reply-To: <20010228202608.E2191@wumpus>; from lindahl@conservativecomputer.com on Wed, Feb 28, 2001 at 08:26:08PM -0500
X-URL:  http://www.advogato.org/person/olive
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Hi there,

Ultimately, and down to the bones, the definition of Cluster is: a group of
interconnected machines that can communicate, and thus coordinate the
execution of whatever task it is supposed to do, be it serve some network
service, compute fast, serve the queues of a batch job system, whatever.

The stress is on "communicate and thus coordinate". If the machines are
somehow aware of the connection between them, they can use this fact to
manage their actions and achieve something better than their normal
capabilities.

Hope this helps!

Fábio
-- 
( Fábio Olivé Leite -*- http://www.conectiva.com.br/~olive )

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 16:22:11 2001
Received: by humbolt.nl.linux.org id <S92177AbRCAPVw>;
	Thu, 1 Mar 2001 16:21:52 +0100
Received: from c004-h005.c004.sfo.cp.net ([209.228.14.76]:49148 "HELO
        c004.sfo.cp.net") by humbolt.nl.linux.org with SMTP
	id <S92170AbRCAPVe>; Thu, 1 Mar 2001 16:21:34 +0100
Received: (cpmta 25445 invoked from network); 1 Mar 2001 07:21:20 -0800
Received: from lca3245.lss.emc.com (HELO jdarcy6986nk) (168.159.123.245)
  by smtp.namezero.com (209.228.14.76) with SMTP; 1 Mar 2001 07:21:20 -0800
X-Sent: 1 Mar 2001 15:21:20 GMT
Message-ID: <00fc01c0a262$d1e25660$f57b9fa8@lss.emc.com>
From:   "Jeff Darcy" <linuxguy@tambreet.com>
To:     "Linux Cluster" <linux-cluster@nl.linux.org>
References: <1111835.983423891926.JavaMail.root@bronze>
Subject: Re: inventory
Date:   Thu, 1 Mar 2001 10:18:06 -0500
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

> > I thought that in most HA clusters, each machine has a "true name" (&
> > IP) that's unique and never changes, and then other names (& IP
> > addresses) that are associated with services that move around when
> > things fail? A single, unique hostname is still sufficient, and you
> > can make them one byte. Then you don't have any need for an extension?
> >
> That is most common, yes.  However, there is one example (with which I'm
> familiar) that this is NOT the case:  HACMP on AIX.  All IP addresses are
> migratable, there is no persistent node name for any machine.

This was not true when I was working on HACMP.  Each interface had a "boot
address" that only it could use, in addition to one or more service
addresses.  When a node came up, it would use its boot address(es) to join
the cluster, and then switch to its properly-assigned service address(es).
Similarly, each node had a node name separate from all of its interface
names, precisely to avoid the sorts of confusion we're talking about.

Did these things change sometime after '95, did Phoenix undo a lot of our
careful design, or is one of us misremembering?



Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 16:29:16 2001
Received: by humbolt.nl.linux.org id <S92171AbRCAP2o>;
	Thu, 1 Mar 2001 16:28:44 +0100
Received: from brutus.conectiva.com.br ([200.250.58.146]:33007 "EHLO
        brutus.conectiva.com.br") by humbolt.nl.linux.org with ESMTP
	id <S92173AbRCAP2X>; Thu, 1 Mar 2001 16:28:23 +0100
Received: from localhost (riel@localhost)
	by brutus.conectiva.com.br (8.11.2/8.11.2) with ESMTP id f21FMj825837;
	Thu, 1 Mar 2001 12:22:45 -0300
X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs
Date:   Thu, 1 Mar 2001 12:22:45 -0300 (BRST)
From:   Rik van Riel <riel@conectiva.com.br>
X-X-Sender:  <riel@duckman.distro.conectiva>
To:     "Albert D. Cahalan" <acahalan@cs.uml.edu>
cc:     "J . A . Magallon" <jamagallon@able.es>,
        Rob Cermak <cermak@IMCS.rutgers.edu>,
        <linux-cluster@nl.linux.org>
Subject: Re: cluster list
In-Reply-To: <200103010208.f2128vn124745@saturn.cs.uml.edu>
Message-ID: <Pine.LNX.4.33.0103011221510.1961-100000@duckman.distro.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Wed, 28 Feb 2001, Albert D. Cahalan wrote:

> Take the above hardware differences, then multiply by 2 to the
> power of the number of features (failover, process migration)
> that people might want. Then considering that people work for
> competing organizations with secrets and incompatible human
> languages, it should be no surprise that little work is shared.

How would these hardware differences affect eg. a lock
manager or a global filesystem ?

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 17:51:03 2001
Received: by humbolt.nl.linux.org id <S92219AbRCAQup>;
	Thu, 1 Mar 2001 17:50:45 +0100
Received: from web9207.mail.yahoo.com ([216.136.129.40]:33288 "HELO
        web9207.mail.yahoo.com") by humbolt.nl.linux.org with SMTP
	id <S92211AbRCAQuV>; Thu, 1 Mar 2001 17:50:21 +0100
Message-ID: <20010301165018.28179.qmail@web9207.mail.yahoo.com>
Received: from [192.148.13.218] by web9207.mail.yahoo.com; Thu, 01 Mar 2001 08:50:18 PST
Date:   Thu, 1 Mar 2001 08:50:18 -0800 (PST)
From:   Peter Badovinatz <tabmowzo@yahoo.com>
Subject: Re: inventory
To:     Linux Cluster <linux-cluster@nl.linux.org>
In-Reply-To: <00fc01c0a262$d1e25660$f57b9fa8@lss.emc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


--- Jeff Darcy <linuxguy@tambreet.com> wrote:
> > > I thought that in most HA clusters, each machine has a "true name" (&
> > > IP) that's unique and never changes, and then other names (& IP
> > > addresses) that are associated with services that move around when
> > > things fail? A single, unique hostname is still sufficient, and you
> > > can make them one byte. Then you don't have any need for an extension?
> > >
> > That is most common, yes.  However, there is one example (with which I'm
> > familiar) that this is NOT the case:  HACMP on AIX.  All IP addresses are
> > migratable, there is no persistent node name for any machine.
> 
> This was not true when I was working on HACMP.  Each interface had a "boot
> address" that only it could use, in addition to one or more service
> addresses.  When a node came up, it would use its boot address(es) to join
> the cluster, and then switch to its properly-assigned service address(es).
> Similarly, each node had a node name separate from all of its interface
> names, precisely to avoid the sorts of confusion we're talking about.
> 
The issue, as I remember it, was that the boot addresses may not be present. 
The specific case was the 'force down' then reintegrate.  In that case, if the
adapters had their service addresses, those needed to be left active when HACMP
restarted and wanted to reintegrate.

As to the node name, you're right.  I was too sloppily trying to combine too
many points.  The specific point didn't deserve the generalization I gave it.

> Did these things change sometime after '95, did Phoenix undo a lot of our
> careful design, or is one of us misremembering?

Phoenix changed nothing about existing HACMP semantics.  We had to adjust
Phoenix to account for this case.  The above behavior was present when we first
needed to support HACMP semantics.  This was around 1997/8, so I have no
knowledge of what, if any changes, occured in HACMP between 1995 and then.


=====
These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
wombat@us.ibm.com/tabmowzo@yahoo.com
and in no way should be construed as official opinion of 
IBM, Corp.

__________________________________________________
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail. 
http://personal.mail.yahoo.com/

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 21:41:13 2001
Received: by humbolt.nl.linux.org id <S92307AbRCAUkm>;
	Thu, 1 Mar 2001 21:40:42 +0100
Received: from gateway.penguincomputing.com ([64.240.166.186]:31734 "EHLO joe")
	by humbolt.nl.linux.org with ESMTP id <S92224AbRCAUjn>;
	Thu, 1 Mar 2001 21:39:43 +0100
Received: from bmartin by joe with local (Exim 3.12 #1 (Debian))
	id 14YZyj-0006hg-00
	for <linux-cluster@nl.linux.org>; Thu, 01 Mar 2001 12:47:13 -0800
Date:   Thu, 1 Mar 2001 12:47:13 -0800
From:   bmartin@penguincomputing.com
To:     linux-cluster@nl.linux.org
Subject: [bmartin@penguincomputing.com: Re: [Linux-ha-dev] Re: [riel@conectiva.com.br: [ANNOUNCE] linux-cluster list]]
Message-ID: <20010301124713.A25737@joe.penguincomputing.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="LZvS9be/3tNcYl/X"
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


--LZvS9be/3tNcYl/X
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi everybody,

I just posted this to linux-ha-dev and Alan Robertson reminded
me that I probably should have posted it here too :)


--LZvS9be/3tNcYl/X
Content-Type: message/rfc822
Content-Disposition: inline

Return-path: <linux-ha-dev-admin@lists.community.tummy.com>
Envelope-to: bmartin@localhost
Delivery-date: Thu, 01 Mar 2001 12:33:54 -0800
Received: from localhost ([127.0.0.1])
	by joe with esmtp (Exim 3.12 #1 (Debian))
	id 14YZlp-0006dr-00
	for <bmartin@localhost>; Thu, 01 Mar 2001 12:33:53 -0800
Received: from pasta.penguincomputing.com
	by localhost with POP3 (fetchmail-5.3.3)
	for bmartin@localhost (single-drop); Thu, 01 Mar 2001 12:33:53 -0800 (PST)
Received: from community.tummy.com (IDENT:qmailr@community.tummy.com [216.17.175.194])
	by ns1.penguincomputing.com (8.9.3/8.9.3) with SMTP id MAA22762
	for <bmartin@penguincomputing.com>; Thu, 1 Mar 2001 12:25:10 -0800
Received: (qmail 12460 invoked from network); 1 Mar 2001 20:25:04 -0000
Received: from localhost (HELO community.tummy.com) (mailman@127.0.0.1)
  by localhost with SMTP; 1 Mar 2001 20:25:04 -0000
Delivered-To: mailman-lists.community.tummy.com-linux-ha-dev@lists.community.tummy.com
Received: (qmail 12419 invoked from network); 1 Mar 2001 20:24:29 -0000
Received: from gateway.penguincomputing.com (HELO joe) (64.240.166.186)
  by community.tummy.com with SMTP; 1 Mar 2001 20:24:29 -0000
Received: from bmartin by joe with local (Exim 3.12 #1 (Debian))
	id 14YZjw-0006dQ-00
	for <linux-ha-dev@lists.community.tummy.com>; Thu, 01 Mar 2001 12:31:56 -0800
Date: Thu, 1 Mar 2001 12:31:56 -0800
From: bmartin@penguincomputing.com
To: linux-ha-dev@lists.community.tummy.com
Subject: Re: [Linux-ha-dev] Re: [riel@conectiva.com.br: [ANNOUNCE] linux-cluster list]
Message-ID: <20010301123156.A25336@joe.penguincomputing.com>
References: <20010228160513.R31173@figure1.int.wirex.com> <Pine.LNX.4.31.0103010218200.21555-100000@netcore.fi>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <Pine.LNX.4.31.0103010218200.21555-100000@netcore.fi>; from pekkas@netcore.fi on Thu, Mar 01, 2001 at 02:20:20AM +0200
Sender: linux-ha-dev-admin@lists.community.tummy.com
Errors-To: linux-ha-dev-admin@lists.community.tummy.com
X-BeenThere: linux-ha-dev@lists.community.tummy.com
X-Mailman-Version: 2.0beta5
Precedence: bulk
Reply-To: linux-ha-dev@lists.community.tummy.com
List-Id: High-Availability Linux Development List <linux-ha-dev.lists.community.tummy.com>

On Thu, Mar 01, 2001 at 02:20:20AM +0200, Pekka Savola wrote:
> On Wed, 28 Feb 2001, Chris Wright wrote:
> 
> > this may be interesting.
> > -chris
> 
> I believe this is about Clustering in supercomputing sense (mosix,
> beowulf, etc.) and has very little to do with High Availability.
> 
> People should be really careful when they talk about clustering..
> 
Yes, they should.

Very careful.

To me people who think heartbeat is just about high availabity and
failover seem to be missing the point.  

Yea, so mosix clusters typically don't use heartbeat.  And beowulf 
clusters don't either.  For that matter most people don't use heartbeat
right now.  

Except for the high availability failover case, one that many people
don't even think fits the definition of a cluster.  

But that is thinking in the present, the world of what's available now
for production.  

Currently the state of clustering under linux is quite varied and 
fragmented, with a myriad of approaches to a myriad of projects.  

What Rik is trying to do is to unify those efforts, specifically as they
relate to the linux kernel.

What Alan is trying to do is unify those same efforts in userspace.  This
means that he won't be producing any kind of distributed shared memory
or process migration, but that is ok as most problems are split somewhere
between userspace and kernel space.

So I think that this is all very relevant and that it would be best if
we all could share as much of a common base as possible.    

It is this very attitude that 'oh what they are doing is different.  It 
doesn't apply here' that fragments the different clustering efforts now
underway.

I agree that many of us have different agendas and are doing vastly 
different things.  

But that doesn't mean that we all don't need many of the very same
components.  

For example, the basics services (interfaces?) that heartbeat provides
could very well be useful in a mosix context.  

Suppose you have your mosix cluster all running off GFS, and one of
your machines stops responding?  What do you do to make sure it's not 
just experiencing some _really_ bad scheduling problems?  

STONITH!

So anyways, I sent Rik some email outlining the basics of what heartbeat
does and is intended to do, just to make sure he knows of the project.

Regards,

Brian

> >
> > ----- Forwarded message from Rik van Riel <riel@conectiva.com.br> -----
> >
> > Date:	Wed, 28 Feb 2001 12:40:44 -0300 (BRST)
> > From: Rik van Riel <riel@conectiva.com.br>
> > To: <linux-cluster@nl.linux.org>
> > Cc: <linux-kernel@vger.kernel.org>, <lwn@lwn.net>
> > Subject: [ANNOUNCE] linux-cluster list
> >
> > On special request, this message is re-sent with [ANNOUNCE] in
> > the subject and the non-announce parts removed.  ;)
> >
> > Feel free to pass this on to whomever you think might be interested.
> > ----
> > 	[on general clustering stuff]
> > On Tue, 27 Feb 2001, David L. Nicol wrote:
> > > Is there a good list to discuss this on?  Is this the list?
> > > Which pieces of clustering-scheme patches would be good to have?
> >
> > I know each of the cluster projects have mailing lists, but
> > I've never heard of a list where the different projects come
> > together to eg. find out which parts of the infrastructure
> > they could share, or ...
> >
> > Since I agree with you that we need such a place, I've just
> > created a mailing list:
> >
> > 	linux-cluster@nl.linux.org
> >
> > To subscribe to the list, send an email with the text
> > "subscribe linux-cluster" to:
> >
> > 	majordomo@nl.linux.org
> >
> >
> > I hope that we'll be able to split out some infrastructure
> > stuff from the different cluster projects and we'll be able
> > to put cluster support into the kernel in such a way that
> > we won't have to make the choice which of the N+1 cluster
> > projects should make it into the kernel...
> >
> > regards,
> >
> > Rik
> > --
> > Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml
> >
> > Virtual memory is like a game you can't win;
> > However, without VM there's truly nothing to lose...
> >
> > 		http://www.surriel.com/
> > http://www.conectiva.com/	http://distro.conectiva.com/
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> > ----- End forwarded message -----
> >
> > ------------------------------------------------------------------------------
> > Linux HA Web Site:
> >   http://linux-ha.org/
> > Linux HA HOWTO:
> >   http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
> > ------------------------------------------------------------------------------
> >
> 
> -- 
> Pekka Savola                  "Tell me of difficulties surmounted,
> Netcore Oy                    not those you stumble over and fall"
> Systems. Networks. Security.   -- Robert Jordan: A Crown of Swords
> 
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.community.tummy.com
> http://lists.community.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.community.tummy.com
http://lists.community.tummy.com/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

--LZvS9be/3tNcYl/X--

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 23:30:04 2001
Received: by humbolt.nl.linux.org id <S92317AbRCAW3j>;
	Thu, 1 Mar 2001 23:29:39 +0100
Received: from hilbert.umkc.edu ([134.193.4.60]:59918 "HELO tesla.umkc.edu")
	by humbolt.nl.linux.org with SMTP id <S92312AbRCAW3M>;
	Thu, 1 Mar 2001 23:29:12 +0100
Received: (qmail 465228 invoked from network); 1 Mar 2001 22:28:09 -0000
Received: from nicol1.umkc.edu (HELO kasey.umkc.edu) (david@134.193.4.62)
  by hilbert.umkc.edu with SMTP; 1 Mar 2001 22:28:09 -0000
Message-ID: <3A9ECCF8.14414C29@kasey.umkc.edu>
Date:   Thu, 01 Mar 2001 16:28:08 -0600
From:   "David L. Nicol" <david@kasey.umkc.edu>
Organization: University of Missouri - Kansas City   supercomputing infrastructure
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i586)
X-Accept-Language: en
MIME-Version: 1.0
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: ETCP Project
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list



Another nominee for the first Clustering Tool to try to
get added into the main linux distribution might be a
set of patches to implement ETCP, based on
the internet draft written by Christian Huitema.

ETCP helps within migration clusters because by using it,
it becomes possible to have a network connection follow
a process, instead of tcp IO being relayed through the
original node.


http://www.chem.ucla.edu/~beichuan/etcp/


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar  1 23:47:24 2001
Received: by humbolt.nl.linux.org id <S92230AbRCAWrD>;
	Thu, 1 Mar 2001 23:47:03 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:24828 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92270AbRCAWqj>; Thu, 1 Mar 2001 23:46:39 +0100
Received: from unix.sh (localhost [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id 65786173B9; Thu,  1 Mar 2001 15:46:01 -0700 (MST)
Message-ID: <3A9ED127.26B129F1@unix.sh>
Date:   Thu, 01 Mar 2001 15:45:59 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     "David L. Nicol" <david@kasey.umkc.edu>
Cc:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
References: <3A9ECCF8.14414C29@kasey.umkc.edu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

"David L. Nicol" wrote:
> 
> Another nominee for the first Clustering Tool to try to
> get added into the main linux distribution might be a
> set of patches to implement ETCP, based on
> the internet draft written by Christian Huitema.
> 
> ETCP helps within migration clusters because by using it,
> it becomes possible to have a network connection follow
> a process, instead of tcp IO being relayed through the
> original node.

The hard problem isn't at the transport layer, but at the application layer
- in synchronizing application state.  For migration clusters, this is
*comparatively* easy, but for failover clusters this is typically very hard.

I've added a link to this page on the High-Availability Linux web site: 
http://linux-ha.org/

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 00:11:26 2001
Received: by humbolt.nl.linux.org id <S92224AbRCAXLA>;
	Fri, 2 Mar 2001 00:11:00 +0100
Received: from inet-smtp3.oracle.com ([205.227.43.23]:63876 "EHLO
        inet-smtp3.oracle.com") by humbolt.nl.linux.org with ESMTP
	id <S92230AbRCAXK2>; Fri, 2 Mar 2001 00:10:28 +0100
Received: from gmgw01.oraclecorp.com (gmgw01.us.oracle.com [130.35.61.190])
	by inet-smtp3.oracle.com (8.9.3/8.9.3) with ESMTP id PAA17792;
	Thu, 1 Mar 2001 15:10:24 -0800 (PST)
Received: from us.oracle.com (dbrower-sun.us.oracle.com [130.35.180.64])
	by gmgw01.oraclecorp.com (8.8.8+Sun/8.8.8) with ESMTP id PAA29526;
	Thu, 1 Mar 2001 15:10:23 -0800 (PST)
Message-ID: <3A9ED6DF.C359628@us.oracle.com>
Date:   Thu, 01 Mar 2001 15:10:23 -0800
From:   David Brower <dbrower@us.oracle.com>
Organization: Oracle Corporation
X-Mailer: Mozilla 4.7 [en] (X11; U; SunOS 5.6 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To:     Alan Robertson <alanr@unix.sh>
CC:     "David L. Nicol" <david@kasey.umkc.edu>,
        "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>,
        linux-ha@muc.de
Subject: Re: ETCP Project & ha/hp overlap
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh>
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Alan Robertson wrote:

> > ETCP helps within migration clusters because by using it,
> > it becomes possible to have a network connection follow
> > a process, instead of tcp IO being relayed through the
> > original node.
> 
> The hard problem isn't at the transport layer, but at the application layer
> - in synchronizing application state.  For migration clusters, this is
> *comparatively* easy, but for failover clusters this is typically very hard.
> 
> I've added a link to this page on the High-Availability Linux web site:
> http://linux-ha.org/

I went and looked at etcp, and it is an example of the sort of
painful things that OS people need to do when apps don't have 
their own checkpoint restart story straight.  While it appears 
to be interesting, and may hold hope in the long term, I don't
think ETCP is interesting on the failover server side -- it is 
interesting for it's original domain (mobile clients) and for 
migration servers, which are the same thing in reverse.  In both those
cases, the application state remains constant on both sides
of the connection that got moved.  In failover-aware h/a,
that app state is usually lost, and this makes
the connection transparency moot for the most part.

This is a good example of the sort of thing I sent to some
people privately earlier today, below.

-dB

To: Lars Marowsky-Bree <lmb@suse.de>
CC: Chris Wright <chris@wirex.com>
Subject: Re: [riel@conectiva.com.br: [ANNOUNCE] linux-cluster list]

I think there are piles of overlap, but I also suspect there are
enough differences that there will be significantly different flavors.
AlanR pointed out some of the complexities of process migration,
as done in Mosix.  It seems to me that there are different levels
of checkpointing and restartability that will distinguish HP from
what I'll prefer to call "commercial" workload instead of HA.  Both
workloads need HP, and both heed HA, but the tradeoffs on
migratability differ greatly.   In the non-scientific space, it is
easy to imagine application platforms (eg: apache mods, database
engines, java/ejb environments) that manage application state in
a way that supports failover w/o particular OS support.  This won't
keep the OS guys from trying to checkpoint processes and migrate
them on failure, but it won't be as efficient.  An app can know that
the state needing recovery is only 48k of the 80M virtual space, and
that there are things around to recover the communication state.  The
OS won't know this, and will need to stash the full 80M, and figure
out how to recover all the tcp connections.  That is what makes
true transparency at the OS level hard.

The OS people are right, though, for naive applications that don't
want to be written to be effectively restartable.  It may be "only
be transparent to people who don't have a watch", in the words
of a colleague, if the OS has to snapshot whole processes frequently.

OTOH, if we ever find ourselves with far too much CPU and i/o
capacity, then OS checkpoint may be a good idea :)

cheers,
-dB

Lars Marowsky-Bree wrote:

> On 2001-02-28T16:45:13,
>    Chris Wright <chris@wirex.com> said:
>
> > Computational and high availability cluster's problem domains are not
> > 100% divergent.  Recall the roots of GFS for example.
>
> My personal prediction is that in the future, High Availability and High
> Performance clustering will merge completely, because anything else doesn't
> make sense at all.
>
> Sincerely,
>     Lars Marowsky-Brée <lmb@suse.de>

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 00:40:17 2001
Received: by humbolt.nl.linux.org id <S92317AbRCAXj6>;
	Fri, 2 Mar 2001 00:39:58 +0100
Received: from gw.xkey.com ([206.86.100.52]:260 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92224AbRCAXjc>;
	Fri, 2 Mar 2001 00:39:32 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id PAA09120 for <linux-cluster@nl.linux.org>; Thu, 1 Mar 2001 15:39:26 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma009116; Thu Mar  1 15:39:21 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f21NdgP01729
	for linux-cluster@nl.linux.org; Thu, 1 Mar 2001 18:39:42 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Thu, 1 Mar 2001 18:39:42 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
Message-ID: <20010301183942.A1690@wumpus>
Mail-Followup-To: "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3A9ED127.26B129F1@unix.sh>; from alanr@unix.sh on Thu, Mar 01, 2001 at 03:45:59PM -0700
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Thu, Mar 01, 2001 at 03:45:59PM -0700, Alan Robertson wrote:

> The hard problem isn't at the transport layer, but at the application layer
> - in synchronizing application state.  For migration clusters, this is
> *comparatively* easy, but for failover clusters this is typically very hard.

I'd agree. But the failover guys have better tools than I thought they
had. I have one thing in my clusters I want to make failover, that's
the "master" node which runs the queue system. The queue system has a
fairly small amount of state, so drbd+heartbeat/takeover looks like
it's good enough for my purposes. Neato.

By the way, I'm writing a piece of code that HA people might find
useful. It's called ForwardFS, and it's a filesystem which forwards
all _system calls_ to another system to get executed. Since the
forwarding is done on a system call basis, there is no caching or
weirdness related to using a block device, like a 1 byte write
followed by a flush causing 8k of traffic. If the node crashes, no
fsck is needed on the remote node. The minus is that there's no
caching, and a bunch of 1 byte writes cause separate network
transactions.

It wouldn't be hard to have system calls which read execute only on
the local node, and system calls which write get executed on the local
and remote node. Voila, it's a HA component.

Condor and Mosix and other migration clusters do this sort of thing
for most syscalls of individual processes, but not for just a part of
the filesystem.  I'm actually hacking up PVFS to write ForwardFS.

-- g


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 00:55:06 2001
Received: by humbolt.nl.linux.org id <S92312AbRCAXyr>;
	Fri, 2 Mar 2001 00:54:47 +0100
Received: from saturn.cs.uml.edu ([129.63.8.2]:57610 "EHLO saturn.cs.uml.edu")
	by humbolt.nl.linux.org with ESMTP id <S92317AbRCAXyT>;
	Fri, 2 Mar 2001 00:54:19 +0100
Received: (from acahalan@localhost)
	by saturn.cs.uml.edu (8.11.0/8.11.2) id f21Ns5v279297;
	Thu, 1 Mar 2001 18:54:05 -0500 (EST)
From:   "Albert D. Cahalan" <acahalan@cs.uml.edu>
Message-Id: <200103012354.f21Ns5v279297@saturn.cs.uml.edu>
Subject: Re: cluster list
To:     riel@conectiva.com.br (Rik van Riel)
Date:   Thu, 1 Mar 2001 18:54:05 -0500 (EST)
Cc:     acahalan@cs.uml.edu (Albert D. Cahalan),
        jamagallon@able.es (J . A . Magallon),
        cermak@IMCS.rutgers.edu (Rob Cermak), linux-cluster@nl.linux.org
In-Reply-To: <Pine.LNX.4.33.0103011221510.1961-100000@duckman.distro.conectiva> from "Rik van Riel" at Mar 01, 2001 12:22:45 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

> How would these hardware differences affect eg. a lock
> manager or a global filesystem ?

I'd have to see the design of course, but...

On systems with reliable broadcast mail, you can do some sort of
a barrier operation across the whole cluster. It is trivial.
(meaning a system where one node can grab the whole interconnect)

On an Ethernet system, two nodes could initiate a broadcast at
the same time. With packet losses and clock uncertainty, one can
not be sure what happened first.

On an Ethernet system, it is very important to group things together.
Packets have high overhead. On a large SMP or NUMA system, grouping
things together would increase latency for no good reason.

I think a global filesystem changes when any node or device can
write directly to any other node or device, without any queuing
on a network card. The node that wants to read data can reserve
space for it, and tell the sender what physical memory address
should be used.


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 01:23:43 2001
Received: by humbolt.nl.linux.org id <S92224AbRCBAXQ>;
	Fri, 2 Mar 2001 01:23:16 +0100
Received: from jalon.able.es ([212.97.163.2]:44714 "EHLO jalon.able.es")
	by humbolt.nl.linux.org with ESMTP id <S92307AbRCBAWn>;
	Fri, 2 Mar 2001 01:22:43 +0100
Received: from correo.able.es ([212.97.169.185]) by
          jalon.able.es (Netscape Messaging Server 4.15) with SMTP id
          G9JNQG00.3MM; Fri, 2 Mar 2001 01:23:04 +0100 
Date:   Fri, 2 Mar 2001 01:22:13 +0100
From:   "J . A . Magallon" <jamagallon@able.es>
To:     =?ISO-8859-1?Q?Josef_H=F6=F6k?= <josef.hook@arrowhead.se>
Cc:     linux-cluster@nl.linux.org
Subject: Re: cluster list
Message-ID: <20010302012213.A1033@werewolf.able.es>
References: <200102281717.f1SHHYe54669@saturn.cs.uml.edu> <Pine.SOL.4.21.0102281332180.28777-100000@imcs.rutgers.edu> <20010228234630.A1256@werewolf.able.es> <3A9E0130.A5A7E93E@arrowhead.se>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
In-Reply-To: <3A9E0130.A5A7E93E@arrowhead.se>; from josef.hook@arrowhead.se on Thu, Mar 01, 2001 at 08:58:40 +0100
X-Mailer: Balsa 1.1.1
Content-Length: 848
Lines:  28
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


On 03.01 Josef Höök wrote:
> 
> That surely reminds me of Plan9 structure.
> What about having it like this instead.
>     /cluster/node/0
>                            ctl
>                            data
>                            listen
>                            local
>                            remote
>                            status
> and in
> 
> /cluste/node
>                       0
>                       1
>                       clone  (adding a new  node to the system)
> 

I can swear that the only thing I knew about Plan9 is its name and 9wm.
Perhaps I am missing so much fun  with Plan9...

-- 
J.A. Magallon                                                      $> cd pub
mailto:jamagallon@able.es                                          $> more beer

Linux werewolf 2.4.2-ac6 #1 SMP Wed Feb 28 01:53:51 CET 2001 i686


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 02:14:00 2001
Received: by humbolt.nl.linux.org id <S92317AbRCBBN3>;
	Fri, 2 Mar 2001 02:13:29 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:22269 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92307AbRCBBMl>; Fri, 2 Mar 2001 02:12:41 +0100
Received: from unix.sh (localhost [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id 9D0EC173E7; Thu,  1 Mar 2001 18:12:14 -0700 (MST)
Message-ID: <3A9EF36D.26867601@unix.sh>
Date:   Thu, 01 Mar 2001 18:12:13 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     Greg Lindahl <lindahl@conservativecomputer.com>
Cc:     linux-cluster <linux-cluster@nl.linux.org>,
        hacqs@hacqs.community.tummy.com
Subject: Re: ETCP Project
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Greg Lindahl wrote:
> 
> I'd agree. But the failover guys have better tools than I thought they
> had.

Us failover guys are glad you think so ;-)

> I have one thing in my clusters I want to make failover, that's
> the "master" node which runs the queue system. The queue system has a
> fairly small amount of state, so drbd+heartbeat/takeover looks like
> it's good enough for my purposes. Neato.

Is your queuing system open source software?

I've been wanting VERY MUCH to start an open source project to put together
an HA/HPC highly available job scheduling/queuing system.  But I didn't know
what tools were out there for cluster scheduling.

If your queuing software stores its queue on disk, in a robust fashion, then
it's easy to fail over using a shared disk or DBRD-type mirroring
arrangement.

For what it's worth, there is a (largely inactive) mailing list for this
project.  You can find it here:
	http://hacqs.community.tummy.com/mailman/listinfo/hacqs

HACQS stands for High-Availability Cluster Queueing System.

Greg:  I just subscribed you.  Hope that's OK... ;-)

One interesting thing that heartbeat does that a queueing system might like
to take advantage of is that each heartbeat packet has a certain amount of
data added to it automatically.  This is done in a fairly modular way, but
one of the things it currently adds automatically is the load average
information from /proc/loadavg.

The API doesn't have a way for your application to retrieve that information
(today), but it could be easily added.  The API is really just now getting
usable, so changing it is no problem.

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 02:26:34 2001
Received: by humbolt.nl.linux.org id <S92321AbRCBB0C>;
	Fri, 2 Mar 2001 02:26:02 +0100
Received: from inet-smtp3.oracle.com ([205.227.43.23]:56028 "EHLO
        inet-smtp3.oracle.com") by humbolt.nl.linux.org with ESMTP
	id <S92312AbRCBBZa>; Fri, 2 Mar 2001 02:25:30 +0100
Received: from gmgw01.oraclecorp.com (gmgw01.us.oracle.com [130.35.61.190])
	by inet-smtp3.oracle.com (8.9.3/8.9.3) with ESMTP id RAA19815;
	Thu, 1 Mar 2001 17:25:24 -0800 (PST)
Received: from oracle.com ([152.68.53.78])
	by gmgw01.oraclecorp.com (8.8.8+Sun/8.8.8) with ESMTP id RAA06660;
	Thu, 1 Mar 2001 17:25:22 -0800 (PST)
Message-ID: <3A9EF590.E4AED8BE@oracle.com>
Date:   Thu, 01 Mar 2001 17:21:20 -0800
From:   David Brower <David.Brower@oracle.com>
Organization: Oracle
X-Mailer: Mozilla 4.7 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To:     Greg Lindahl <lindahl@conservativecomputer.com>
CC:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list



Greg Lindahl wrote:

>
> It wouldn't be hard to have system calls which read execute only on
> the local node, and system calls which write get executed on the local
> and remote node. Voila, it's a HA component.
>

I don't understand.  Read what execute only on the local system?
If it's by system call, then there's no local data to read locally.  I'm
confused.

thanks,
-dB


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 02:34:01 2001
Received: by humbolt.nl.linux.org id <S92307AbRCBBda>;
	Fri, 2 Mar 2001 02:33:30 +0100
Received: from gw.xkey.com ([206.86.100.52]:25864 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92230AbRCBBdN>;
	Fri, 2 Mar 2001 02:33:13 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id RAA15025 for <linux-cluster@nl.linux.org>; Thu, 1 Mar 2001 17:33:10 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma015020; Thu Mar  1 17:33:04 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f221XTI02018
	for linux-cluster@nl.linux.org; Thu, 1 Mar 2001 20:33:29 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Thu, 1 Mar 2001 20:33:29 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
Message-ID: <20010301203329.A1984@wumpus>
Mail-Followup-To: "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3A9EF590.E4AED8BE@oracle.com>; from David.Brower@oracle.com on Thu, Mar 01, 2001 at 05:21:20PM -0800
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

> > It wouldn't be hard to have system calls which read execute only on
> > the local node, and system calls which write get executed on the local
> > and remote node. Voila, it's a HA component.
> 
> I don't understand.  Read what execute only on the local system?
> If it's by system call, then there's no local data to read locally.  I'm
> confused.

I didn't explain it fully.

The basic ForwardFS is forwarding system calls to another system, so
it can have any filesystem on the remote system, and there is no local
data on disk.

The HigherAvailabilityMirrorFS I described is a filesystem that can
execute a given system call in 2 places: (1) against a local
underlying filesystem (of any kind), and (2) against a remote
underlying filesystem (of any kind). PVFS is actually forwarding the
call up to a user-level process. In order to keep the 2 underlying
filesystems synchronized, you need to do all writes against both, but
reads only need to go against 1. (Screw atime.)

-- g


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 02:39:04 2001
Received: by humbolt.nl.linux.org id <S92317AbRCBBid>;
	Fri, 2 Mar 2001 02:38:33 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:42493 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92224AbRCBBiP>; Fri, 2 Mar 2001 02:38:15 +0100
Received: from unix.sh (localhost [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id 9D0D4175BA; Thu,  1 Mar 2001 18:37:49 -0700 (MST)
Message-ID: <3A9EF96C.ABEC0EA@unix.sh>
Date:   Thu, 01 Mar 2001 18:37:48 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     Greg Lindahl <lindahl@conservativecomputer.com>
Cc:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Greg Lindahl wrote:

> The HigherAvailabilityMirrorFS I described is a filesystem that can
> execute a given system call in 2 places: (1) against a local
> underlying filesystem (of any kind), and (2) against a remote
> underlying filesystem (of any kind). PVFS is actually forwarding the
> call up to a user-level process. In order to keep the 2 underlying
> filesystems synchronized, you need to do all writes against both, but
> reads only need to go against 1. (Screw atime.)

You might look into Intermezzo.  It does something very similar.  It is
referenced by the linux-ha home page http://linux-ha.org/

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 02:46:49 2001
Received: by humbolt.nl.linux.org id <S92317AbRCBBq3>;
	Fri, 2 Mar 2001 02:46:29 +0100
Received: from gw.xkey.com ([206.86.100.52]:57096 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92312AbRCBBqL>;
	Fri, 2 Mar 2001 02:46:11 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id RAA15750; Thu, 1 Mar 2001 17:46:02 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma015745; Thu Mar  1 17:45:58 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f221kOQ02066;
	Thu, 1 Mar 2001 20:46:24 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Thu, 1 Mar 2001 20:46:24 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster <linux-cluster@nl.linux.org>
Cc:     Hacqs@hacqs.community.tummy.com
Subject: Re: [Hacqs] Re: ETCP Project
Message-ID: <20010301204624.B1984@wumpus>
Mail-Followup-To: linux-cluster <linux-cluster@nl.linux.org>,
	Hacqs@hacqs.community.tummy.com
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF36D.26867601@unix.sh>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3A9EF36D.26867601@unix.sh>; from alanr@unix.sh on Thu, Mar 01, 2001 at 06:12:13PM -0700
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

> > I have one thing in my clusters I want to make failover, that's
> > the "master" node which runs the queue system. The queue system has a
> > fairly small amount of state, so drbd+heartbeat/takeover looks like
> > it's good enough for my purposes. Neato.
> 
> Is your queuing system open source software?

Yes, it's OpenPBS. It (and the commercial PBSPro) are careful to keep
all of their state in little files on disk and to fsync files
frequently. It's not guaranteed that things won't go wrong, but it
seems to be fairly unlikely. Maybe you could talk them into using the
new berkeleydb stuff from sleepycat.

I would suspect that some of the other open source queue systems have
this same property. DQS, GNU Queue, and the allegedly-to-be-open-sourced
Codine/GRD/whatever it's named today from Sun.

Now do keep in mind that surviving a queue server crash is completely
different from having jobs that can survive a compute node crash... or
jobs that can be correctly restarted from the beginning or a
checkpoint...

-- greg

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 02:48:20 2001
Received: by humbolt.nl.linux.org id <S92312AbRCBBr7>;
	Fri, 2 Mar 2001 02:47:59 +0100
Received: from gw.xkey.com ([206.86.100.52]:60936 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92307AbRCBBre>;
	Fri, 2 Mar 2001 02:47:34 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id RAA15769 for <linux-cluster@nl.linux.org>; Thu, 1 Mar 2001 17:47:32 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma015765; Thu Mar  1 17:47:31 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f221luO02071
	for linux-cluster@nl.linux.org; Thu, 1 Mar 2001 20:47:56 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Thu, 1 Mar 2001 20:47:56 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
Message-ID: <20010301204756.C1984@wumpus>
Mail-Followup-To: "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus> <3A9EF96C.ABEC0EA@unix.sh>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3A9EF96C.ABEC0EA@unix.sh>; from alanr@unix.sh on Thu, Mar 01, 2001 at 06:37:48PM -0700
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

> You might look into Intermezzo.  It does something very similar.  It is
> referenced by the linux-ha home page http://linux-ha.org/

Yes, I know Peter Braam. The HPC world is pretty small, you know.
Intermezzo has the same infrastructure that PVFS does, and in fact I
think the PVFS guys looked at it to build their kernel interface.

-- g


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 03:03:34 2001
Received: by humbolt.nl.linux.org id <S92317AbRCBCDN>;
	Fri, 2 Mar 2001 03:03:13 +0100
Received: from inet-smtp4.oracle.com ([209.246.15.58]:26529 "EHLO
        inet-smtp4.oracle.com") by humbolt.nl.linux.org with ESMTP
	id <S92312AbRCBCCn>; Fri, 2 Mar 2001 03:02:43 +0100
Received: from gmgw01.oraclecorp.com (gmgw01.us.oracle.com [130.35.61.190])
	by inet-smtp4.oracle.com (8.9.3/8.9.3) with ESMTP id SAA08178;
	Thu, 1 Mar 2001 18:02:31 -0800 (PST)
Received: from oracle.com ([152.68.53.78])
	by gmgw01.oraclecorp.com (8.8.8+Sun/8.8.8) with ESMTP id SAA12421;
	Thu, 1 Mar 2001 18:02:30 -0800 (PST)
Message-ID: <3A9EFE42.BD838D90@oracle.com>
Date:   Thu, 01 Mar 2001 17:58:27 -0800
From:   David Brower <David.Brower@oracle.com>
Organization: Oracle
X-Mailer: Mozilla 4.7 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To:     Greg Lindahl <lindahl@conservativecomputer.com>
CC:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

there are huge cache consistency problems with this unless
you lock access for concurrent writers of the same data on
each node.    Who has -the- authoritative data if/when two
copies are different?  How do you resolve them?  You've
stumbled into a classic problem space-- synchronization
of replicas and resolution when they fall out of sync.

"The only thing interesting in h/a is what happens during and
after the failures."

FY consideration, the TruCluster CFS is a layer that resolves
the coherency issues, running on top of whatever filesystems
are available underneath.  If I understand correctly, it will
turn a mounted FAT partition into a correct CFS (but not
a very high perfomance one).

The way this sort of thing is usually done by CFS's that aren't
symmetric access (GFS is symmetric) is that one node really
mounts the FS; a layer handles the concurrent access, and other
nodes do remote ops to the mounting node.  The mounting node
is done in an HA way so that there is always one available.  Then,
the FS is build on logical volumes that mirror the storage in
different places -- such as LVM, using a local disk and maybe
an drbd like device.  Then these layers have problems at mirror
divergence time.  (The process usually involves picking a winning
side and "resilvering" the mirror with its contents -- bad if both
have changes that should be kept).

What you are proposing is to create a layer similar that maps two
truly different file systems, and coordinates their access.  Without
one point of truth defining "the" buffer in question somehow, I don't
know how you are going to make it work consistently correctly.

The big problem case to work out is:  processes on each node
writing to the end of a log file.  You must get all lines from all processes
in correct time order, without losing anything.  This is a highly contended
block
with concurrent writes, followed by file extension.

cheers,
-dB

Greg Lindahl wrote:

> > > It wouldn't be hard to have system calls which read execute only on
> > > the local node, and system calls which write get executed on the local
> > > and remote node. Voila, it's a HA component.
> >
> > I don't understand.  Read what execute only on the local system?
> > If it's by system call, then there's no local data to read locally.  I'm
> > confused.
>
> I didn't explain it fully.
>
> The basic ForwardFS is forwarding system calls to another system, so
> it can have any filesystem on the remote system, and there is no local
> data on disk.
>
> The HigherAvailabilityMirrorFS I described is a filesystem that can
> execute a given system call in 2 places: (1) against a local
> underlying filesystem (of any kind), and (2) against a remote
> underlying filesystem (of any kind). PVFS is actually forwarding the
> call up to a user-level process. In order to keep the 2 underlying
> filesystems synchronized, you need to do all writes against both, but
> reads only need to go against 1. (Screw atime.)
>
> -- g
>
> Linux-cluster: generic cluster infrastructure for Linux
> Archive:       http://mail.nl.linux.org/linux-cluster/


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 03:06:56 2001
Received: by humbolt.nl.linux.org id <S92312AbRCBCGd>;
	Fri, 2 Mar 2001 03:06:33 +0100
Received: from lmg.ahnet.net ([207.150.192.13]:28707 "EHLO lmg02.affinity.com")
	by humbolt.nl.linux.org with ESMTP id <S92307AbRCBCGO>;
	Fri, 2 Mar 2001 03:06:14 +0100
Received: from notbilly.affinity.com ([207.150.192.49]) by lmg.ahnet.net with ESMTP id <398539-11294>; Thu, 1 Mar 2001 18:05:58 -0800
Date:   Thu, 1 Mar 2001 18:05:48 -0800 (PST)
From:	Andy Poling <andy@realbig.com>
X-Sender: andy@notbilly.affinity.com
To:	linux-cluster <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
In-Reply-To: <3A9EF36D.26867601@unix.sh>
Message-ID: <Pine.LNX.4.21.0103011801550.5002-100000@notbilly.affinity.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Greg Lindahl sez:
> I'd agree. But the failover guys have better tools than I thought they
> had.

Alan Robertson sez:
> I've been wanting VERY MUCH to start an open source project to put together
> an HA/HPC highly available job scheduling/queuing system.  But I didn't know
> what tools were out there for cluster scheduling.

Well, I'd say these two statements are proof that this list was a good 
idea.  :-)

I subscribed because (as others have said) my needs (immediate and future)
also encompass HA- and performance- clustering.  It certainly is good to see
the two converging so conveniently...

-Andy


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 03:15:20 2001
Received: by humbolt.nl.linux.org id <S92307AbRCBCOv>;
	Fri, 2 Mar 2001 03:14:51 +0100
Received: from gw.xkey.com ([206.86.100.52]:52489 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92230AbRCBCOd>;
	Fri, 2 Mar 2001 03:14:33 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id SAA16821 for <linux-cluster@nl.linux.org>; Thu, 1 Mar 2001 18:14:30 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma016789; Thu Mar  1 18:12:22 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f222ClQ02160
	for linux-cluster@nl.linux.org; Thu, 1 Mar 2001 21:12:47 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Thu, 1 Mar 2001 21:12:47 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
Message-ID: <20010301211247.B2115@wumpus>
Mail-Followup-To: "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus> <3A9EFE42.BD838D90@oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3A9EFE42.BD838D90@oracle.com>; from David.Brower@oracle.com on Thu, Mar 01, 2001 at 05:58:27PM -0800
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Thu, Mar 01, 2001 at 05:58:27PM -0800, David Brower wrote:

> The big problem case to work out is:  processes on each node
> writing to the end of a log file.  You must get all lines from all processes
> in correct time order, without losing anything.  This is a highly contended
> block
> with concurrent writes, followed by file extension.

I'm afraid you've mistaken what I suggested for something which allows
concurrent access. It doesn't. One server is active, the other is
passive. It's exactly the same as drbd, only one level higher in the
OS: VFS layer instead of block device layer.

-- g


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 03:23:39 2001
Received: by humbolt.nl.linux.org id <S92312AbRCBCXS>;
	Fri, 2 Mar 2001 03:23:18 +0100
Received: from inet-smtp4.oracle.com ([209.246.15.58]:20659 "EHLO
        inet-smtp4.oracle.com") by humbolt.nl.linux.org with ESMTP
	id <S92307AbRCBCWv>; Fri, 2 Mar 2001 03:22:51 +0100
Received: from gmgw01.oraclecorp.com (gmgw01.us.oracle.com [130.35.61.190])
	by inet-smtp4.oracle.com (8.9.3/8.9.3) with ESMTP id SAA15972;
	Thu, 1 Mar 2001 18:22:50 -0800 (PST)
Received: from oracle.com ([152.68.53.78])
	by gmgw01.oraclecorp.com (8.8.8+Sun/8.8.8) with ESMTP id SAA12572;
	Thu, 1 Mar 2001 18:22:48 -0800 (PST)
Message-ID: <3A9F0306.414B6A7B@oracle.com>
Date:   Thu, 01 Mar 2001 18:18:46 -0800
From:   David Brower <David.Brower@oracle.com>
Organization: Oracle
X-Mailer: Mozilla 4.7 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To:     Greg Lindahl <lindahl@conservativecomputer.com>
CC:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus> <3A9EFE42.BD838D90@oracle.com> <20010301211247.B2115@wumpus>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

It's still dangerous, because you don't have anything forcing one
node to be passive, just convention and wishful thinking.  That is,
the 'mirror' FS is mounted and active on the second machine.  There's
nothing to keep it from writing.

If this is all you're doing, you might just as well use a
logical volume with a remote block device as the mirror without
the hassle of writing a whole new FS layer.  If you care about your
data integrity, the active first node is fsyncing data that it cares about,
and this flushes through both the local disk and the remote mirror.  If
it isn't fsyincing, remoting the write(2) call could have the curious semantic
of the data getting flushed to the disk on the remote node, but still
lying around the cache of the local node.  That could be confusing.

-dB

Greg Lindahl wrote:

> On Thu, Mar 01, 2001 at 05:58:27PM -0800, David Brower wrote:
>
> > The big problem case to work out is:  processes on each node
> > writing to the end of a log file.  You must get all lines from all processes
> > in correct time order, without losing anything.  This is a highly contended
> > block
> > with concurrent writes, followed by file extension.
>
> I'm afraid you've mistaken what I suggested for something which allows
> concurrent access. It doesn't. One server is active, the other is
> passive. It's exactly the same as drbd, only one level higher in the
> OS: VFS layer instead of block device layer.
>
> -- g
>
> Linux-cluster: generic cluster infrastructure for Linux
> Archive:       http://mail.nl.linux.org/linux-cluster/


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 03:38:58 2001
Received: by humbolt.nl.linux.org id <S92317AbRCBCib>;
	Fri, 2 Mar 2001 03:38:31 +0100
Received: from gw.xkey.com ([206.86.100.52]:40970 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92230AbRCBCiJ>;
	Fri, 2 Mar 2001 03:38:09 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id SAA17655 for <linux-cluster@nl.linux.org>; Thu, 1 Mar 2001 18:38:06 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma017651; Thu Mar  1 18:38:04 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f222cTT02224
	for linux-cluster@nl.linux.org; Thu, 1 Mar 2001 21:38:29 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Thu, 1 Mar 2001 21:38:29 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: ETCP Project
Message-ID: <20010301213829.A2220@wumpus>
Mail-Followup-To: "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus> <3A9EFE42.BD838D90@oracle.com> <20010301211247.B2115@wumpus> <3A9F0306.414B6A7B@oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3A9F0306.414B6A7B@oracle.com>; from David.Brower@oracle.com on Thu, Mar 01, 2001 at 06:18:46PM -0800
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Thu, Mar 01, 2001 at 06:18:46PM -0800, David Brower wrote:

> It's still dangerous, because you don't have anything forcing one
> node to be passive, just convention and wishful thinking.  That is,
> the 'mirror' FS is mounted and active on the second machine.  There's
> nothing to keep it from writing.

What makes you think that? Yes, this is a concern, but there are
technical ways to ensure that. The underlying FS can sit on a path
that the user-level daemon opens and then chmods one of the elements
to something such that no one else can go there.

> If this is all you're doing, you might just as well use a
> logical volume with a remote block device as the mirror without
> the hassle of writing a whole new FS layer.

My first posting listed some advantages and disadvantages of that
approach, namely that with a block device such as drbd you have to
fsck if there's a failure or use a journaled filesystem. A second
advantage/problem is that block devices may very well transfer more
data for certain access patterns (1 block minimum transfer), but of
course they can use caching but the VFS layer goober can't.

> If you care about your data integrity, the active first node is
> fsyncing data that it cares about, and this flushes through both the
> local disk and the remote mirror.  If it isn't fsyincing, remoting
> the write(2) call could have the curious semantic of the data
> getting flushed to the disk on the remote node, but still lying
> around the cache of the local node.  That could be confusing.

My VFS layer can fsync on behalf of the program after any write. Of
course the semantics of filesystem metadata getting written to disk
are interesting, but I suspect I'd be doing journaling in my VFS layer
anyway. Other manifestations of PVFS (the single-metadata-controller
parallel filesystem manifestation) need journaling anyway.

-- g

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 03:52:46 2001
Received: by humbolt.nl.linux.org id <S92317AbRCBCw1>;
	Fri, 2 Mar 2001 03:52:27 +0100
Received: from c004-h006.c004.sfo.cp.net ([209.228.14.77]:52696 "HELO
        c004.sfo.cp.net") by humbolt.nl.linux.org with SMTP
	id <S92307AbRCBCwA>; Fri, 2 Mar 2001 03:52:00 +0100
Received: (cpmta 25755 invoked from network); 1 Mar 2001 18:51:48 -0800
Received: from adsl-151-203-49-173.bostma.adsl.bellatlantic.net (HELO jdarcy6986nk) (151.203.49.173)
  by smtp.namezero.com (209.228.14.77) with SMTP; 1 Mar 2001 18:51:48 -0800
X-Sent: 2 Mar 2001 02:51:48 GMT
Message-ID: <03ac01c0a2c3$4562f710$f57b9fa8@lss.emc.com>
From:   "Jeff Darcy" <jeff@tambreet.com>
To:     "Linux Cluster" <linux-cluster@nl.linux.org>
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus> <3A9EFE42.BD838D90@oracle.com> <20010301211247.B2115@wumpus> <3A9F0306.414B6A7B@oracle.com> <425194.983500784731.JavaMail.root@lavender>
Subject: Re: ETCP Project
Date:   Thu, 1 Mar 2001 21:48:32 -0500
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

> > It's still dangerous, because you don't have anything forcing one
> > node to be passive, just convention and wishful thinking.  That is,
> > the 'mirror' FS is mounted and active on the second machine.  There's
> > nothing to keep it from writing.
>
> What makes you think that? Yes, this is a concern, but there are
> technical ways to ensure that. The underlying FS can sit on a path
> that the user-level daemon opens and then chmods one of the elements
> to something such that no one else can go there.

I don't think that will work.  A lot of metadata blocks are shared across
files with no other relationship.  If you're working on file A, that may
involve metadata blocks X, Y, and Z.  The only way you can be *sure* nobody
else is using X, Y, or Z would be to block out access to the entire
filesystem...unless, that is, you know an awful lot about the metadata
structure, and that's exactly what I thought you were trying to avoid.


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 10:49:46 2001
Received: by humbolt.nl.linux.org id <S92191AbRCBJtS>;
	Fri, 2 Mar 2001 10:49:18 +0100
Received: from anakin.xinit.se ([194.14.168.3]:60170 "EHLO anakin.xinit.se")
	by humbolt.nl.linux.org with ESMTP id <S92167AbRCBJs6>;
	Fri, 2 Mar 2001 10:48:58 +0100
Received: from arrowhead.se (unknown [195.22.75.66])
	by anakin.xinit.se (Postfix) with ESMTP
	id 824B52C04A; Fri,  2 Mar 2001 10:48:44 +0100 (CET)
Message-ID: <3A9F6C57.469B1375@arrowhead.se>
Date:   Fri, 02 Mar 2001 10:48:07 +0100
From:   Josef =?iso-8859-1?Q?H=F6=F6k?= <josef.hook@arrowhead.se>
X-Mailer: Mozilla 4.51 [sv] (WinNT; U)
X-Accept-Language: sv
MIME-Version: 1.0
To:     Greg Lindahl <lindahl@conservativecomputer.com>,
        linux-cluster@nl.linux.org
Subject: Re: ETCP Project
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list



Greg Lindahl skrev:

> > > It wouldn't be hard to have system calls which read execute only on
> > > the local node, and system calls which write get executed on the local
> > > and remote node. Voila, it's a HA component.
> >
> > I don't understand.  Read what execute only on the local system?
> > If it's by system call, then there's no local data to read locally.  I'm
> > confused.
>
> I didn't explain it fully.
>
> The basic ForwardFS is forwarding system calls to another system, so
> it can have any filesystem on the remote system, and there is no local
> data on disk.
>
> The HigherAvailabilityMirrorFS I described is a filesystem that can
> execute a given system call in 2 places: (1) against a local
> underlying filesystem (of any kind), and (2) against a remote
> underlying filesystem (of any kind). PVFS is actually forwarding the
> call up to a user-level process. In order to keep the 2 underlying
> filesystems synchronized, you need to do all writes against both, but
> reads only need to go against 1. (Screw atime.)
>

Have you given a thought that you maybe are reinventing the wheel.
Take a look at the 9p/IL protocol for plan9, but also the kOrbit project at
sourceforge.
/joh
--
C++ is just for people who's lacking a development organisation.

>
> -- g
>
> Linux-cluster: generic cluster infrastructure for Linux
> Archive:       http://mail.nl.linux.org/linux-cluster/


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar  2 20:33:08 2001
Received: by humbolt.nl.linux.org id <S92230AbRCBTct>;
	Fri, 2 Mar 2001 20:32:49 +0100
Received: from gw.xkey.com ([206.86.100.52]:19205 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92224AbRCBTcW>;
	Fri, 2 Mar 2001 20:32:22 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id LAA15910 for <linux-cluster@nl.linux.org>; Fri, 2 Mar 2001 11:32:16 -0800
Received: from hpti8.fsl.noaa.gov(137.75.132.228) by happy.xkey.com via smtp (V1.3)
	id sma015897; Fri Mar  2 11:32:07 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f22JWOE01683
	for linux-cluster@nl.linux.org; Fri, 2 Mar 2001 14:32:24 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Fri, 2 Mar 2001 14:32:24 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster@nl.linux.org
Subject: Re: ETCP Project
Message-ID: <20010302143224.A1595@wumpus>
Mail-Followup-To: linux-cluster@nl.linux.org
References: <3A9ECCF8.14414C29@kasey.umkc.edu> <3A9ED127.26B129F1@unix.sh> <20010301183942.A1690@wumpus> <3A9EF590.E4AED8BE@oracle.com> <20010301203329.A1984@wumpus> <3A9F6C57.469B1375@arrowhead.se>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.2.5i
In-Reply-To: <3A9F6C57.469B1375@arrowhead.se>; from josef.hook@arrowhead.se on Fri, Mar 02, 2001 at 10:48:07AM +0100
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Fri, Mar 02, 2001 at 10:48:07AM +0100, Josef Höök wrote:

> Have you given a thought that you maybe are reinventing the wheel.
> Take a look at the 9p/IL protocol for plan9, but also the kOrbit project at
> sourceforge.

I'm sure there are similar things. However, I didn't tell you why I'm
getting paid to write one piece, which I'm sure isn't reinventing the
wheel. I don't really care what the protocol is, that's not interesting.

-- g

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sun Mar  4 04:48:07 2001
Received: by humbolt.nl.linux.org id <S92181AbRCDDrs>;
	Sun, 4 Mar 2001 04:47:48 +0100
Received: from gw.xkey.com ([206.86.100.52]:40198 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92178AbRCDDrZ>;
	Sun, 4 Mar 2001 04:47:25 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id TAA09378 for <linux-cluster@nl.linux.org>; Sat, 3 Mar 2001 19:47:17 -0800
Received: from user-2ivej21.dialup.mindspring.com(165.247.76.65) by happy.xkey.com via smtp (V1.3)
	id sma009373; Sat Mar  3 19:47:13 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f243lYF02222
	for linux-cluster@nl.linux.org; Sat, 3 Mar 2001 22:47:34 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Sat, 3 Mar 2001 22:47:34 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     Linux Cluster <linux-cluster@nl.linux.org>
Subject: running out of low ports
Message-ID: <20010303224734.A2175@wumpus>
Mail-Followup-To: Linux Cluster <linux-cluster@nl.linux.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

One thing I'd like in the kernel is a way to avoid needlessly running
out of low ports. If you want a low port, you have to loop calling
bind() to get the port before you can call connect(). As a result you
can never re-use a port that's in the TIME_WAIT state -- the kernel
doesn't know where you're headed, so it can't ensure that you won't
get a port that was talking to that same remote IP.

Because of this limit you can only make outgoing low-port connections
at a rate of 1024/TIME_WAIT_TIME or less. Which is pretty slow.

-- g


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sun Mar  4 11:57:06 2001
Received: by humbolt.nl.linux.org id <S92234AbRCDK4o>;
	Sun, 4 Mar 2001 11:56:44 +0100
Received: from swan.en-bio.COM.AU ([203.35.254.3]:31016 "EHLO
        swan.au.en-bio.com") by humbolt.nl.linux.org with ESMTP
	id <S92225AbRCDK41>; Sun, 4 Mar 2001 11:56:27 +0100
Received: from entigen.com (slip-32-103-30-22.il.us.prserv.net [32.103.30.22])
	by swan.au.en-bio.com (8.9.1a/8.9.1) with ESMTP id VAA05728
	for <linux-cluster@nl.linux.org>; Sun, 4 Mar 2001 21:56:04 +1100
Message-ID: <3AA21FF0.F723833E@entigen.com>
Date:   Sun, 04 Mar 2001 02:58:56 -0800
From:   SunnyvaleApartment #2 <phantom@entigen.com>
X-Mailer: Mozilla 4.76 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To:     linux-cluster@nl.linux.org
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

majordomo@nl.linux.org


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar  5 20:43:18 2001
Received: by humbolt.nl.linux.org id <S92245AbRCETmk>;
	Mon, 5 Mar 2001 20:42:40 +0100
Received: from brutus.conectiva.com.br ([200.250.58.146]:28666 "EHLO
        brutus.conectiva.com.br") by humbolt.nl.linux.org with ESMTP
	id <S92193AbRCETmK>; Mon, 5 Mar 2001 20:42:10 +0100
Received: from localhost (riel@localhost)
	by brutus.conectiva.com.br (8.11.2/8.11.2) with ESMTP id f25JgSn07732
	for <linux-cluster@nl.linux.org>; Mon, 5 Mar 2001 16:42:28 -0300
X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs
Date:   Mon, 5 Mar 2001 16:42:28 -0300 (BRST)
From:   Rik van Riel <riel@conectiva.com.br>
X-X-Sender:  <riel@duckman.distro.conectiva>
To:     <linux-cluster@nl.linux.org>
Subject: Re: your mail
In-Reply-To: <3AA21FF0.F723833E@entigen.com>
Message-ID: <Pine.LNX.4.33.0103051641170.1409-100000@duckman.distro.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Sun, 4 Mar 2001, SunnyvaleApartment #2 wrote:

> majordomo@nl.linux.org

*sigh*

I've contacted the guy offline and have added yet another
regexp to my majordomo.cf to make sure this won't be happening
again ...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Wed Mar 14 21:54:24 2001
Received: by humbolt.nl.linux.org id <S92471AbRCNUxv>;
	Wed, 14 Mar 2001 21:53:51 +0100
Received: from hilbert.umkc.edu ([134.193.4.60]:19461 "HELO tesla.umkc.edu")
	by humbolt.nl.linux.org with SMTP id <S92468AbRCNUxP>;
	Wed, 14 Mar 2001 21:53:15 +0100
Received: (qmail 28569 invoked from network); 14 Mar 2001 20:52:06 -0000
Received: from nicol1.umkc.edu (HELO kasey.umkc.edu) (david@134.193.4.62)
  by hilbert.umkc.edu with SMTP; 14 Mar 2001 20:52:06 -0000
Message-ID: <3AAFD9F6.CDC26C1F@kasey.umkc.edu>
Date:   Wed, 14 Mar 2001 20:52:06 +0000
From:   "David L. Nicol" <david@kasey.umkc.edu>
Organization: University of Missouri - Kansas City   supercomputing infrastructure
X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0 i586)
X-Accept-Language: en
MIME-Version: 1.0
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: O'Reilly Seeks Participants for 2nd P2P Conference
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

CALL FOR PARTICIPATION
The O'Reilly Peer-to-Peer Conference
Omni Shoreham Hotel, Washington, DC
September 17-20, 2001

PROPOSALS DUE: April 2, 2001

OVERVIEW

O'Reilly & Associates is pleased to announce its second Peer-to-Peer
and Web Services conference, an event exploring the technical,
business, and legal dimensions of the fast-growing Peer-to-Peer and Web
Services spaces.

Individuals and companies interested in making presentations, giving a
tutorial, or participating in panel discussions are invited to submit
proposals.

SUBJECT MATTER

Because the Peer-to-Peer and Web Services spaces are still relatively
unformed, we're casting the net widely. Any innovative application that
harnesses the power of distributed computers, users, services, or
devices, and the technical, business, or legal issues raised by such
applications, are appropriate subjects for this conference.

While the conference will consist of various tracks informed by the
subject matter of the submissions, presentations are expected to lean
more toward the technical or business/legal side. Technical
presentations should be of interest to developers and administrators of
Internet applications and infrastructure. Business/legal focused
presentations should appeal to entrepreneurs, venture capitalists,
technical strategists, lawmakers and law-breakers.

PROPOSALS

Proposed talks should be 20, 30, or 60 minutes long. If you are
interested in participating in or moderating panel discussions or
otherwise contributing to the conference, please do make this known
along with your preferred technical or business slant. If you have an
idea for a particularly provocative group of panelists that you'd love
to see square off, feel free to send in your suggestions.

LIGHTNING TALKS

Lightning talks give you a whirlwind tour of companies, projects (both
completed and not), research, experiments, and interesting ideas in the
Peer-to-Peer and Web Services spaces. Each Lightning Talk session gives
a dozen presenters an opportunity to give a 5-minute elevator pitch.

There are three Lightning Talk tracks:

- Technical
- Business
- "Wobbly Bits"

The last track is a space for unfinished, unpolished, possibly
abandoned, in-need-of-help, and other "wobbly" projects.  Presenters
should talk about what they've learned, what they've solved or
overcome, ongoing issues and tribulations, the current state of their
project, what bits are needed, and so on.

Sessions will be wrapped up with a panel discussion.

Presentations should be informative, creative, and/or entertaining.

DETAILS

For further information, topic examples, and proposal details and
instructions, please visit

http://conferences.oreilly.com/p2p/call_fall.html.

If you have any questions, feel free to send email to
p2pconf@oreilly.com.


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sat Mar 17 02:09:04 2001
Received: by humbolt.nl.linux.org id <S92254AbRCQBIX>;
	Sat, 17 Mar 2001 02:08:23 +0100
Received: from hilbert.umkc.edu ([134.193.4.60]:38663 "HELO tesla.umkc.edu")
	by humbolt.nl.linux.org with SMTP id <S92231AbRCQBHz>;
	Sat, 17 Mar 2001 02:07:55 +0100
Received: (qmail 55199 invoked from network); 17 Mar 2001 01:06:43 -0000
Received: from nicol1.umkc.edu (HELO kasey.umkc.edu) (david@134.193.4.62)
  by hilbert.umkc.edu with SMTP; 17 Mar 2001 01:06:43 -0000
Message-ID: <3AB2B8A3.13385A04@kasey.umkc.edu>
Date:   Fri, 16 Mar 2001 19:06:43 -0600
From:   "David L. Nicol" <david@kasey.umkc.edu>
Organization: University of Missouri - Kansas City   supercomputing infrastructure
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i586)
X-Accept-Language: en
MIME-Version: 1.0
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: available resource declaration language(s)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


Various resources that can be shared; some are architecture-dependent
(idle CPU) and some are independent (extra memory for network-speed 
swap device, available hard disk for balanced SAN) and some, such as
PVM-based distribution, are not commoditizable outside of their own terms.
(PVM requires the worker nodes have functioning C compilers, for compiling
the pieces that will run on themselves, once.)

Could the way the clustered machines find out about each other be
standardized?

Mosix uses a peer-to-peer architecture in which each node periodically
queries a peer selected at random from its list of peers; what archtectures
do other projects use?

Has anyone done any serious simulations of the efficiency of various discovery
methods? For instance, it is easy to imagine a virtual ring architecture in
which each node shares everything it knows about all other nodes in a larger
packet which is sent around the ring and a node can only initiate a resource
request when it has the token, for instance; or broadcast-based architectures
in which a node advertises its surplus resources with a periodic broadcast packet,
and nodes wishing to use the resource would begin a negotiation.

Thoughts?  Pointers to masters' theses?








-- 
                      David Nicol 816.235.1187 dnicol@cstp.umkc.edu
If God had meant us to compute securely, He'd have given
us more prime numbers! -- Casey Schaufler


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sat Mar 17 12:21:56 2001
Received: by humbolt.nl.linux.org id <S92263AbRCQLVN>;
	Sat, 17 Mar 2001 12:21:13 +0100
Received: from 197-MADR-X13.libre.retevision.es ([62.83.2.197]:21486 "EHLO
        carlos.mosix.net") by humbolt.nl.linux.org with ESMTP
	id <S92258AbRCQLUp>; Sat, 17 Mar 2001 12:20:45 +0100
Received: from carlos by carlos.mosix.net with local (Exim 3.12 #1 (Debian))
	id 14eFdM-0000Po-00
	for <linux-cluster@nl.linux.org>; Sat, 17 Mar 2001 13:16:36 +0100
Subject: Re: available resource declaration language(s)
From:   carlos manzanedo <cvsrep@wanadoo.es>
To:     linux-cluster@nl.linux.org
In-Reply-To: <3AB2B8A3.13385A04@kasey.umkc.edu>
Content-Type: text/plain
X-Mailer: Evolution 0.5.1 (Developer Preview)
Date:   17 Mar 2001 11:16:36 -0100
Mime-Version: 1.0
Message-Id: <E14eFdM-0000Po-00@carlos.mosix.net>
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

As you say in the kernel list ,I also think that it should be discussed
about how about to standarize the way that clusters affect the kernel
and how they find each others.But i don't see any discussion in the
cluster list that Rick create.
May be people of mosix team could contribute the discussion and start to
explain what they think about the way to standarize different clusters
in the kernel.


> 
> Various resources that can be shared; some are architecture-dependent
> (idle CPU) and some are independent (extra memory for network-speed 
> swap device, available hard disk for balanced SAN) and some, such as
> PVM-based distribution, are not commoditizable outside of their own terms.
> (PVM requires the worker nodes have functioning C compilers, for compiling
> the pieces that will run on themselves, once.)
> 
> Could the way the clustered machines find out about each other be
> standardized?
> 
> Mosix uses a peer-to-peer architecture in which each node periodically
> queries a peer selected at random from its list of peers; what archtectures
> do other projects use?
> 
> Has anyone done any serious simulations of the efficiency of various discovery
> methods? For instance, it is easy to imagine a virtual ring architecture in
> which each node shares everything it knows about all other nodes in a larger
> packet which is sent around the ring and a node can only initiate a resource
> request when it has the token, for instance; or broadcast-based architectures
> in which a node advertises its surplus resources with a periodic broadcast packet,
> and nodes wishing to use the resource would begin a negotiation.
> 
> Thoughts?  Pointers to masters' theses?
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
>                       David Nicol 816.235.1187 dnicol@cstp.umkc.edu
> If God had meant us to compute securely, He'd have given
> us more prime numbers! -- Casey Schaufler
> 
> 
> Linux-cluster: generic cluster infrastructure for Linux
> Archive:       http://mail.nl.linux.org/linux-cluster/
> 

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sat Mar 17 13:26:13 2001
Received: by humbolt.nl.linux.org id <S92261AbRCQMZi>;
	Sat, 17 Mar 2001 13:25:38 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:33010 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92254AbRCQMZK>; Sat, 17 Mar 2001 13:25:10 +0100
Received: from unix.sh (unknown [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id CC03217984; Sat, 17 Mar 2001 05:24:52 -0700 (MST)
Message-ID: <3AB35794.77389DB8@unix.sh>
Date:   Sat, 17 Mar 2001 05:24:52 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     "David L. Nicol" <david@kasey.umkc.edu>
Cc:     linux-cluster <linux-cluster@nl.linux.org>
Subject: Re: available resource declaration language(s)
References: <3AB2B8A3.13385A04@kasey.umkc.edu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

"David L. Nicol" wrote:
> 
> Various resources that can be shared; some are architecture-dependent
> (idle CPU) and some are independent (extra memory for network-speed
> swap device, available hard disk for balanced SAN) and some, such as
> PVM-based distribution, are not commoditizable outside of their own terms.
> (PVM requires the worker nodes have functioning C compilers, for compiling
> the pieces that will run on themselves, once.)
> 
> Could the way the clustered machines find out about each other be
> standardized?

There are several meanings to this statement.  I had one initial
interpretation of it the first time I saw it, and now, after rereading your
article, I have another one.

In high-availability clusters, the emphasis is on longer-term (more static)
knowledge about the systems and their resources.  This kind of knowledge has
a lifetime of days, months or years.  The term "resource" in an HA cluster
typically has this meaning.  A resource might be an IP address, or a web
server, or a NIC.

For load balancing purposes (which appears to be the purpose of this query),
one needs more dynamic information.
 
> Mosix uses a peer-to-peer architecture in which each node periodically
> queries a peer selected at random from its list of peers; what archtectures
> do other projects use?

[I'll offer an answer to this below]

> Has anyone done any serious simulations of the efficiency of various discovery
> methods? For instance, it is easy to imagine a virtual ring architecture in
> which each node shares everything it knows about all other nodes in a larger
> packet which is sent around the ring and a node can only initiate a resource
> request when it has the token, for instance; or broadcast-based architectures
> in which a node advertises its surplus resources with a periodic broadcast packet,
> and nodes wishing to use the resource would begin a negotiation.

I would suggest that one should try to be agnostic as to "how" this
information is collected/propagated in the cluster, and provide programs
that need this kind of information an API which would allow one to obtain
the information from the cluster in a uniform manner regardless of how it is
collected.  This API is much more important than any particular mechanism
which supports it.

Then one would not be wed to a particular architecture, but could in fact
use one of several different methods depending on other factors, the results
of current research, etc without making life hard on customers.  I'll get up
on my soap box at a later date on this score.

Now, having made an argument for having an implementation-neutral API for
accessing the information...  Here's what my code actually does...

Heartbeat (my low-level cluster membership/communication layer) sends
multicast keep-alive (heartbeat) packets every second or so.  These packets
are ASCII name/value pairs.  One of the values sent in every packet is the
content of /proc/loadavg.

The scheme is quite flexible, and one could add other information to each
packet quite easily.  These heartbeat packets are currently a bit larger
than 150 bytes each, including this information and the digital signature. 
Making them 250 bytes each would not be a significant extra burden - even in
a large cluster.

There is a paper on heartbeat design here:
	http://linux-ha.org/comm/HBdesign.pdf

and a talk on it's APIs here:
	http://linux-ha.org/heartbeat/LWCE-NYC-2001/

The heartbeat cluster membership/communications layer is not specific to
high-availability clusters, but can in fact serve for other types of
clusters as well.  This is why I send this information around, even though
it isn't of any particular use to a straight failover cluster.

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sat Mar 17 13:32:48 2001
Received: by humbolt.nl.linux.org id <S92257AbRCQMcK>;
	Sat, 17 Mar 2001 13:32:10 +0100
Received: from imladris.infradead.org ([194.205.184.45]:54285 "EHLO
        infradead.org") by humbolt.nl.linux.org with ESMTP
	id <S92256AbRCQMbk>; Sat, 17 Mar 2001 13:31:40 +0100
Received: from [200.206.140.111] (helo=atenea.orcero.org)
	by infradead.org with esmtp (Exim 3.20 #2)
	id 14eFrv-0004yD-00
	for linux-cluster@nl.linux.org; Sat, 17 Mar 2001 12:31:39 +0000
Received: from localhost (localhost.localdomain [127.0.0.1])
	by atenea.orcero.org (8.11.0/8.11.0) with ESMTP id f2H9S3e01094;
	Sat, 17 Mar 2001 06:28:03 -0300
Date:   Sat, 17 Mar 2001 06:28:03 -0300 (BRT)
From:   David Santo Orcero <irbis@orcero.org>
To:     "David L. Nicol" <david@kasey.umkc.edu>
cc:     <linux-cluster@nl.linux.org>
Subject: Re: available resource declaration language(s)
In-Reply-To: <3AB2B8A3.13385A04@kasey.umkc.edu>
Message-ID: <Pine.LNX.4.30.0103170610350.988-100000@atenea.orcero.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list



 Hello, all!

> methods? For instance, it is easy to imagine a virtual ring architecture in
> which each node shares everything it knows about all other nodes in a larger
> packet which is sent around the ring and a node can only initiate a resource
> request when it has the token, for instance; or broadcast-based architectures

 This idea will work fine on little networks, but I strondly doubt that
something like these scale well. If you have only one token, you have some
technical problems; the tree greater one are:

1) What will happed if a node hangs if he is with the packet? (it will
hang the whole parallel process in cluster, you will need a complex
negociation rule to create a new token, like in token ring networks)

2) What will happed if a malicious/erroneous node send a new second token
to the network? (on a P2P protocol, a malicious/erroneus node have
shorter posibilities to damage the network; anyway, Mosix solution is also
far away of being safe)

3) If the network is really BIG -500, 600 nodes- the delay to get the
token will be a problem. Somebody can say -buy a faster network-; but it
is better not to force to the user spend more bucks because we use a worse
solution.


 The three problems can be solved in a broadcast net, as proposed? No! If
you have enougth nodes, you will flood the network; that is why I am
really sure that using broadcast features it will not work.

 Personally I thing that, independent of using Mosix arch or other
completly thing, the architecture and the protocolls must be peer-2-peer,
and all negociations must be distributed, using a random query as method
of beginning a new P2P negociation; exactly as Mosix does. This will give
to us an escalable and failure-proof method to negociate the share of the
resources and to exchange information with other nodes.


 Yours:

David


---------------------------
http://www.orcero.org/irbis
    irbis@orcero.org
---------------------------

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sat Mar 17 14:15:41 2001
Received: by humbolt.nl.linux.org id <S92258AbRCQNOw>;
	Sat, 17 Mar 2001 14:14:52 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:46578 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92257AbRCQNO3>; Sat, 17 Mar 2001 14:14:29 +0100
Received: from unix.sh (unknown [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id A279317BA1; Sat, 17 Mar 2001 06:14:12 -0700 (MST)
Message-ID: <3AB36324.D75A3888@unix.sh>
Date:   Sat, 17 Mar 2001 06:14:12 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     David Santo Orcero <irbis@orcero.org>
Cc:     "David L. Nicol" <david@kasey.umkc.edu>, linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
References: <Pine.LNX.4.30.0103170610350.988-100000@atenea.orcero.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

David Santo Orcero wrote:
> 
>  Hello, all!


> 3) If the network is really BIG -500, 600 nodes- the delay to get the
> token will be a problem. Somebody can say -buy a faster network-; but it
> is better not to force to the user spend more bucks because we use a worse
> solution.
> 
>  The three problems can be solved in a broadcast net, as proposed? No! If
> you have enougth nodes, you will flood the network; that is why I am
> really sure that using broadcast features it will not work.

This is a common misconception.  I've done the calculations on this, and
they don't support your assertion.  See the heartbeat design paper
referenced earlier:
	http://linux-ha.org/comm/HBdesign.pdf

Paraphrasing from that paper:
For a 1000 node system and a 150 byte heartbeat packet, and a 1 second
heartbeat interval, the bandwidth is approximately 1.2% of the bandwidth
available on an unswitched 100 Mbit network.

If you double the packet size, it would rise to 2.5%.  This is pretty small
for such a large cluster.  I would hope that such a large cluster would have
faster networking anyway ;-)

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sat Mar 17 15:26:48 2001
Received: by humbolt.nl.linux.org id <S92257AbRCQO0V>;
	Sat, 17 Mar 2001 15:26:21 +0100
Received: from [200.206.140.111] ([200.206.140.111]:13828 "EHLO
        atenea.orcero.org") by humbolt.nl.linux.org with ESMTP
	id <S92201AbRCQO0A>; Sat, 17 Mar 2001 15:26:00 +0100
Received: from localhost (localhost.localdomain [127.0.0.1])
	by atenea.orcero.org (8.11.0/8.11.0) with ESMTP id f2HBO9e01243;
	Sat, 17 Mar 2001 08:24:09 -0300
Date:   Sat, 17 Mar 2001 08:24:09 -0300 (BRT)
From:   David Santo Orcero <irbis@orcero.org>
To:     Alan Robertson <alanr@unix.sh>
cc:     "David L. Nicol" <david@kasey.umkc.edu>,
        <linux-cluster@nl.linux.org>
Subject: Re: available resource declaration language(s)
In-Reply-To: <3AB36324.D75A3888@unix.sh>
Message-ID: <Pine.LNX.4.30.0103170745370.1225-100000@atenea.orcero.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list



 Hello, All:

> >  The three problems can be solved in a broadcast net, as proposed? No! If
> > you have enougth nodes, you will flood the network; that is why I am
> > really sure that using broadcast features it will not work.
>
> This is a common misconception.  I've done the calculations on this, and

 No, This is not a misconception. It is my own experience developing
parallel software -ok, all is at user level, but the network cable can't
see the difference between a packet generated by a kernel hacker and a
packet generated by a developer of scientific software-.

 You are trasmiting a short package and wait if a NACK arrives.
Maybe you never ever need establish a TCP connection, neither a simple UCP
transmition; you have enouth with ICMP packets. Between this and full
interchange of information about the sistem -uptime, system
performance, number of process, number of empty slots on system tables,
memory free and so on- there is a HUGE difference. Test it. No matter how
small is your package, you will have to use UDP or TCP. TCP is impossible
due to kernel tables limitations, and UDP is lots bigger that ICMP.



> For a 1000 node system and a 150 byte heartbeat packet, and a 1 second
> heartbeat interval, the bandwidth is approximately 1.2% of the bandwidth
> available on an unswitched 100 Mbit network.

 Did you calculed this operation mathematically, or did you do the test on
a cluster? The results may be completly different, due to the colisions.
There is not a thing like "the amounth of information that a chanel can
transmit" but "the package on the network uses physical space, and when
one package is traveling, the rest must stop. And on the spreadest of the
networks they don't, and they do colitions. Well, we can ask to the user:
"you can't use ethernet to do clustering", but this will leave the most of
the people out of the game.

> If you double the packet size, it would rise to 2.5%.  This is pretty small

 You are calculating mathematically, dividing the peak bit rate that
can be obtained using the network by number of the bits that
you transmit! If you double a packet size, the network usage on the cheap
networks _never_ mutiplies by 2, due the colitions! As an clear example
that anybody can test, if you send 3Mb on a second from one node and other
3Mb on a second on other node, it is not true that in a third node the
information will arrive at 6Mb/seg! In fact, you will have lots of
colitions on the channel.


 Let's assume a network that allows broadcast of node information
and a full exchange of information between your 1000 nodes.
Renember that the MAINTANCE of the  data collected on a non-P2P solution
will be also a problem: a O(n^2). Let's assume that you have a more
efficient algorithm, O(n). You will have the 1000 nodes, sending
constantly information, you sending information constantly, and doing
constant modification of the table. Maybe it will be a good solution to do
Linux as efficient as Amoeba.

  If you do not broadcast, and you do P2P with random poll we will send
few packets por second on your 1000 nodes network, and we will overload
the kernel with a O(k) algorithm.


 Yours:

David


---------------------
http://www.orcero.org
  irbis@orcero.org
---------------------


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sat Mar 17 17:30:33 2001
Received: by humbolt.nl.linux.org id <S92291AbRCQQaE>;
	Sat, 17 Mar 2001 17:30:04 +0100
Received: from cs.huji.ac.il ([132.65.16.10]:16093 "EHLO cs.huji.ac.il")
	by humbolt.nl.linux.org with ESMTP id <S92287AbRCQQ3r>;
	Sat, 17 Mar 2001 17:29:47 +0100
Received: from mos227.cs.huji.ac.il ([132.65.173.227] ident=exim)
	by cs.huji.ac.il with esmtp (Exim 3.20 #1)
	id 14eJZs-0005g8-00; Sat, 17 Mar 2001 18:29:16 +0200
Received: from amnon by mos227.cs.huji.ac.il with local (Exim 3.15 #1)
	id 14eJZs-0001uK-00; Sat, 17 Mar 2001 18:29:16 +0200
To:     cvsrep@wanadoo.es, linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
Cc:     amnon@cs.huji.ac.il
Message-Id: <E14eJZs-0001uK-00@mos227.cs.huji.ac.il>
From:   "Prof. Amnon Barak" <amnon@cs.huji.ac.il>
Date:   Sat, 17 Mar 2001 18:29:16 +0200
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Hi all:

David L. Nicol" <david@kasey.umkc.edu> wrote:

> Mosix uses a peer-to-peer architecture in which each node periodically
> queries a peer selected at random from its list of peers;

In MOSIX we researched and made extensive simulations of many information
dissemination schemes. We ended up with the following algorithms:
1. Each node continuously monitors its internal resources.
2. Once every unit of time (1 sec.) each node sends a message to
   either a random node (within the predefined cluster range) or
   to a node with which it communicated recently or to an historical
   node (with which it communicated in the not too distance past).
3. Each node maintains a (small revolving) "window to the world" by
   keeping the most recently arrived information messages. This ensures
   that when a node need information about other nodes it looks in its
   own cache and does not need to send any inquiry messages.

   As a result, MOSIX supports:
1. Scalability without overflowing the LAN (each node sends exactly
   one message each unit of time), no broadcasts;
2. A cluster with "sparse" node numbers, e.g. when a subset of nodes fail;
3. "Dynamic" clusters in which nodes leave or join at any time.
4. Flood prevention, e.g. when a new node join an overloaded cluster.
   This is accomplished by the "gradual information dissemination" which
   slow down the propagation of info about the idle (or less loaded) new
   node, thus only a small subset of other nodes attempt to approach the
   new node.
 
There are many more "hidden" algorithms that we researched and tested to
support process migration for load-balancing, e.g. machines with different
speeds or different free memory, etc. 

-Amnon
 The MOSIX team.

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sun Mar 18 06:57:44 2001
Received: by humbolt.nl.linux.org id <S92256AbRCRF5L>;
	Sun, 18 Mar 2001 06:57:11 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:5111 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92179AbRCRF42>; Sun, 18 Mar 2001 06:56:28 +0100
Received: from unix.sh (localhost [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id CD82B17BA6; Sat, 17 Mar 2001 22:56:14 -0700 (MST)
Message-ID: <3AB44DFE.5FC6CFAB@unix.sh>
Date:   Sat, 17 Mar 2001 22:56:14 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     David Santo Orcero <irbis@orcero.org>
Cc:     "David L. Nicol" <david@kasey.umkc.edu>, linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
References: <Pine.LNX.4.30.0103170745370.1225-100000@atenea.orcero.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

David Santo Orcero wrote:
> 

[snip] 

>  You are trasmiting a short package and wait if a NACK arrives.
> Maybe you never ever need establish a TCP connection, neither a simple UCP
> transmition; you have enouth with ICMP packets. Between this and full
> interchange of information about the sistem -uptime, system
> performance, number of process, number of empty slots on system tables,
> memory free and so on- there is a HUGE difference. Test it. No matter how
> small is your package, you will have to use UDP or TCP. TCP is impossible
> due to kernel tables limitations, and UDP is lots bigger that ICMP.

What would make you think about TCP?  Certainly neither I, nor the article
mentioned it, nor is it in the code, and I certainly wouldn't recommend it.

Cluster membership already requires this keepalive data.  You just piggyback
this other stuff on top of it.  Zero additional packets are sent for this
data.  I certainly wouldn't recommend the approach you are talking about
either ;-).    I have *one* connection per machine for all control packets -
TOTAL.  Not "n" connections per machine.  TCP is stupid for this for lots of
reasons - like being O(N**2), etc.  The overhead in kernel connection tables
doesn't go up with the number of machines in the cluster AT ALL.  It is
constant - O(1).

It uses a multicast protocol.  You can run this protocol on top of udp
broadcast, udp multicast, or (for really small clusters) serial ports, or
whatever unreliable transport you want to run it on.  The way it works, is
that all control data is multiplexed across a single multicast channel -
which minimizes the overhead.  This has other advantages that I don't have
time to go into here. 

Read the article.  Read the email you responded to ;-).  Better yet, try the
code.  I'd be delighted to hear how it works for you.

Too bad Wombat (Peter Badovinatz) is in Australia on vacation, so he can't
comment on it any time soon..  This is pretty similar to the proven approach
used by IBM's Phoenix clustering software.  They deploy clusters of 500 or
more machines quite successfully.  I'm sure he'll be quite disappointed to
learn it doesn't work ;-)

> > For a 1000 node system and a 150 byte heartbeat packet, and a 1 second
> > heartbeat interval, the bandwidth is approximately 1.2% of the bandwidth
> > available on an unswitched 100 Mbit network.
> 
>  Did you calculed this operation mathematically, or did you do the test on
> a cluster? The results may be completly different, due to the colisions.
> There is not a thing like "the amounth of information that a chanel can
> transmit" but "the package on the network uses physical space, and when
> one package is traveling, the rest must stop. And on the spreadest of the
> networks they don't, and they do colitions. Well, we can ask to the user:
> "you can't use ethernet to do clustering", but this will leave the most of
> the people out of the game.

Collisions are rarely a problem in a properly configured full-duplex
switched network. A switch is cheap.  For example, the 24-port 100-mbit
full-duplex switch on my home network cost about $300 USD.  If you have a
cluster - buy a switch.  Even an expensive switch costs less than a single
node.  Without a switch you can't really put together any kind of cluster.

> > If you double the packet size, it would rise to 2.5%.  This is pretty small
> 
>  You are calculating mathematically, dividing the peak bit rate that
> can be obtained using the network by number of the bits that
> you transmit! If you double a packet size, the network usage on the cheap
> networks _never_ mutiplies by 2, due the colitions! As an clear example
> that anybody can test, if you send 3Mb on a second from one node and other
> 3Mb on a second on other node, it is not true that in a third node the
> information will arrive at 6Mb/seg! In fact, you will have lots of
> colitions on the channel.

Why aren't you running full-duplex switches?  They *are* cheap.

I certainly understand that you can't run any ethernet channel full blast. 
No problem.  But you can run it *lots* faster than 1.2% full.  Or, for the
smaller clusters you were talking about, 0.6% full.  If modifying the
traffic by .6% of busy causes you a problem, your network is too close to
the edge, and will be in trouble in a few days when the load grows even if
all this traffic is removed.

>  Let's assume a network that allows broadcast of node information
> and a full exchange of information between your 1000 nodes.
> Renember that the MAINTANCE of the  data collected on a non-P2P solution
> will be also a problem: a O(n^2).

Each machine receives "n" updates per period of time.  Updating an in-memory
table is pretty cheap.  If you code it right, the time to update an entry
for a node is constant, so the total overhead is O(n).

> Let's assume that you have a more
> efficient algorithm, O(n).

Good assumption. See above.

> You will have the 1000 nodes, sending
> constantly information, you sending information constantly, and doing
> constant modification of the table. Maybe it will be a good solution to do
> Linux as efficient as Amoeba.

What an amazing paragraph!

In all cases you have 1000 nodes sending updates constantly.  Each is doing
1 update per unit time.  In all cases you have machines updating the tables
constantly.  In my case, each machine performs "n" in-memory updates per
unit time.

How will you know which machines are working and which aren't unless you
have some kind of keepalive or heartbeat?  This function (cluster
membership) is a necessary function, unless unreliable clusters are the only
kind you are interested in.

If your machines are not constantly sending information, then they're idle. 
Sending information is not evil.  All methods under discussion send AT LEAST
one update per unit time.  That's what heartbeat does - one update packet
per unit time.

>   If you do not broadcast, and you do P2P with random poll we will send
> few packets por second on your 1000 nodes network, and we will overload
> the kernel with a O(k) algorithm.

Of course it multicasts. (?!?).

If you change from multicast to unicast, you don't decrease the number of
packets sent, but poor implementations can easily increase it ;-).  What you
do *clearly* change is the number of packets *received* by each node. 
That's the improvement that you get from MOSIX's method.

Receiving fewer packets is nice.  On the other hand, the implementation
complexity is higher, the latency on receiving changes in information from
nodes is higher, and you will have much greater difficulty telling quickly
and reliably if a node leaves the cluster unexpectedly.  This latter piece
is the single most important property for a high-availability cluster.

============ Now to the important part of this note ;-)   ==============

The thing I emphasized the most in my initial post was that whatever method
the 
applications use to get this data must be standardized through a single
agnostic API.  This discussion points out *clearly* why I believe this quite
passionately.

The Mosix method has some nice properties.  The method I use has different
nice properties.  But neither method has all the nice properties at once. 
One causes fewer packet receptions but has high latency, poor membership
properties, and more complex code.  One is simpler, has lower latency, good
membership properties, but causes more packet receptions.

One method is wonderful for some environments, the other wonderful for other
environments.  Each works very well in certain niches, and the niches they
work best in aren't the same.  This is normal.  It's fine.  In fact, it's
probably good!

The most important conclusion I draw from this interchange is that we MUST
create a framework into which we can plug various methods, and have the
client applications not care at all.  If we create such a framework, then
the technologies can fight it out, and the winner will always be the user. 
And when someone comes out with an even better method for doing this, or one
that serves a particular niche better, then we can just plug it in, and get
the benefits immediately.

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sun Mar 18 15:36:52 2001
Received: by humbolt.nl.linux.org id <S92322AbRCROgV>;
	Sun, 18 Mar 2001 15:36:21 +0100
Received: from [200.206.140.111] ([200.206.140.111]:10759 "EHLO
        atenea.orcero.org") by humbolt.nl.linux.org with ESMTP
	id <S92318AbRCROfk>; Sun, 18 Mar 2001 15:35:40 +0100
Received: from localhost (localhost.localdomain [127.0.0.1])
	by atenea.orcero.org (8.11.0/8.11.0) with ESMTP id f2IBXfe03368;
	Sun, 18 Mar 2001 08:33:41 -0300
Date:   Sun, 18 Mar 2001 08:33:41 -0300 (BRT)
From:   David Santo Orcero <irbis@orcero.org>
To:     Alan Robertson <alanr@unix.sh>
cc:     "David L. Nicol" <david@kasey.umkc.edu>,
        <linux-cluster@nl.linux.org>
Subject: Re: available resource declaration language(s)
In-Reply-To: <3AB44DFE.5FC6CFAB@unix.sh>
Message-ID: <Pine.LNX.4.30.0103180716510.3147-100000@atenea.orcero.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list



 Hello, all!

 (firts of all: yes, I read your paper. yes, I read your message. Anyway,
I perceived on your answer that we are talking about two completly
different things; HA and HP are two different worlds)


> What would make you think about TCP?  Certainly neither I, nor the article
> mentioned it, nor is it in the code, and I certainly wouldn't recommend it.


 It is a protocol that it is used in high-performance cluster application.
You did not talked about it, and I did talk to keep outside the
conversation there we have a common point, thus there is no problem there.

> Read the article.  Read the email you responded to ;-).  Better yet, try the
> code.  I'd be delighted to hear how it works for you.

 Yes, I read it and your full paper. And I thing that we are making
different assumption. You are thinking in HA with a switch. In this case,
your approach is PERFECT. No other discusion.

 I am talking about HP clusters, and with low budgets. This is another
completly different world, with different needs.

> Collisions are rarely a problem in a properly configured full-duplex
> switched network. A switch is cheap.  For example, the 24-port 100-mbit
> full-duplex switch on my home network cost about $300 USD.  If you have a

 (it is only to focus my position. Yours will be different)

  First problem. A switch costs here, in Brazil, more that $500USD, and in
north-northwest is more expensive. We have here a 68% importation tax. And
it can not be found in all regions of Brazil.

 Second problem. It is not one switch for the entire cluster. And we are
not talking about last-generation-Digital-nodes. Here the best
relationship between MIPS/$ is K6-II. Really, you have not too much more
to buy -maybe on the south, but you can not find the last-generation
processor in all Brasil-. That means more nodes, than means more than one
switch.

 Switchs were discarted at the first budget. Maybe other country, or
other kind of problem. Yes, we can afford some switches, but it is the
cost of one disk node or two diskless node.


> Why aren't you running full-duplex switches?  They *are* cheap.

  It depends were are you, and rationale below.

> traffic by .6% of busy causes you a problem, your network is too close to
> the edge, and will be in trouble in a few days when the load grows even if
> all this traffic is removed.


 Well, here is the point. All my nodes are overloaded, my network is at
the edge. By night, I overloaded also the teaching computer labs.

(Rationale) On my research field -protein structure-, the first takes
all. It doesn't matter what do you do; if somebody send the structure
before you to PDB you will not published. That means that the work of a
group of nearly a dozen of researchers goes to trash, all money spent with
reagents and so on. If it happends some time in a year, some people of the
groups may lost their grant. Some research club of some countries (US,
Canada) have really  gigant clusters and budgets tree orders of magnitude
bigger than my full university. It is like running with a Beattle against
a F1 car. And sometimes we win with our Beattle.

 That means that the most of the money go to processors and memory, and we
must use our resources at 100%. 98% is not enough. And that is the
battlefield of HP.

 Maybe your battlefield -HA- is completly different, but a broadcast
solution, directly, can't work on HP. If you have a HP solution with
broadcasting, you have a better one P2P. At last, It is my own experience
on HP clusters. Yours may be different.

> Each machine receives "n" updates per period of time.  Updating an in-memory
> table is pretty cheap.  If you code it right, the time to update an entry
> for a node is constant, so the total overhead is O(n).


 Yes, if your work is only see if your node is working -O(n)-. But if you
have to find the better node that fits in a group of constraints to send a
piece of work, it is not so easy as updating a table.

> How will you know which machines are working and which aren't unless you
> have some kind of keepalive or heartbeat?  This function (cluster
> membership) is a necessary function, unless unreliable clusters are the only
> kind you are interested in.

 Particulary, you are right. The most of the node of my cluster are
unrelable by the owner of the machine -that can switch off the machine at
any time-. I resolve this at application level, and I thing that this is a
problem to be solved at application level. But the most of the HP clusters
are in the same way: non-dedicated hardware. The budget is always a
problem; and the most of us used all our budget on a core cluster, and
after this we try to use pieces of CPU time of non-dedicated machines.

 Anyway, I thing that in a HP cluster is not so important to know at
application level who is working ad whois not. The fact that this were
transparent is marvellous, and anybody that had to configure a Beowulf PVM
cluster knows why. ;-)

 On HP clusters, you are not sending hearbeats. You are sending
information about the nodes and LOTS of information about the problem that
you are solving.


> If your machines are not constantly sending information, then they're idle.

 My machines never are idle.

> and reliably if a node leaves the cluster unexpectedly.  This latter piece
> is the single most important property for a high-availability cluster.


 This phrase is that showed to me that meanwhile I am playing soccer, you
are playing basketball. The rules are different. ;-)

> The thing I emphasized the most in my initial post was that whatever method
> applications use to get this data must be standardized through a single
> agnostic API.  This discussion points out *clearly* why I believe this quite
> passionately.


 I am begining to thing that we will need two different APIs, that will be
different kernel options. One is HA ckusters and other HP clusters. They
are SO different -you are showing me this- that I find quite dificult to
do a HP+HA API.


> The most important conclusion I draw from this interchange is that we MUST
> create a framework into which we can plug various methods, and have the
> client applications not care at all.  If we create such a framework, then
> the technologies can fight it out, and the winner will always be the user.

 Perfect. I am not going to talk about HA, but HP; and I thing that in
that case the framework would have the following guidelines -it is my
proponsal-. I will use as reference the four things that I have used more
-MPI, PVM, Mosix and Beowulf-. The four are completly differeng things,
but I will not talk about implementation, but features; it is a wish list.

1) The cluster have to be a semantics. PVM have a semantics, MPI have a
semantics, Mosix have a semantics. Maybe Mosix one is better -the whole
cluster is shown to the user as a SMP machine-. Mosix does this, thus it
is possible. Maybe one of the hot points of the discussion is deciding
what semantics is better for a HP cluster.

2) It should be a efficient method to send a task from the beginning to
the least loaded node  of the cluster. PVM have this, Mosix have not -in
Mosix the task can migrate after being launched, but its kernel part will
be executed on the launch node-.

3) It should be portable between different Linux architectures. Mosix are
not, the others are. (For me, it does not matter; but I know groups that
will find great this).

4) The network will be as transparent af we could. Mosix is great for
this, PVM and MPI does a good work, and Beowulf does nothing.

5) It must to allow to run cheap hardware efficiently. this take out
broadcast protocols, sorry. ;-)

6) Migrating running task is great. Mosix does a good work here, but not
perfect -sockets and shared memory code can not migrate-.

 I thing that failure tolerance must be resolved at user level. HP have
lots of completly different soft, and a general solution can be an
unaceptable overload -renember that in HP we alwais are at the limit of
the machines-.

 Well, it is only my opinion. I am only a developer of HP applications and
maintainer of a really weird cluster (with dedicated and non-dedicated
nodes), but maybe gives some insight to somebody.


 Yours:

David


---------------------
http://www.orcero.org
  irbis@orcero.org
---------------------


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sun Mar 18 16:23:16 2001
Received: by humbolt.nl.linux.org id <S92303AbRCRPWq>;
	Sun, 18 Mar 2001 16:22:46 +0100
Received: from c004-h015.c004.sfo.cp.net ([209.228.14.102]:28051 "HELO
        c004.sfo.cp.net") by humbolt.nl.linux.org with SMTP
	id <S92301AbRCRPWT>; Sun, 18 Mar 2001 16:22:19 +0100
Received: (cpmta 21526 invoked from network); 18 Mar 2001 07:22:12 -0800
Received: from adsl-151-203-49-173.bostma.adsl.bellatlantic.net (HELO jdarcy6986nk) (151.203.49.173)
  by smtp.namezero.com (209.228.14.102) with SMTP; 18 Mar 2001 07:22:12 -0800
X-Sent: 18 Mar 2001 15:22:12 GMT
Message-ID: <075901c0afbe$8ed896e0$bd7b9fa8@lss.emc.com>
From:   "Jeff Darcy" <jeff@tambreet.com>
To:     "Alan Robertson" <alanr@unix.sh>,
        "David Santo Orcero" <irbis@orcero.org>
Cc:     "David L. Nicol" <david@kasey.umkc.edu>,
        <linux-cluster@nl.linux.org>
References: <Pine.LNX.4.30.0103170745370.1225-100000@atenea.orcero.org> <3AB44DFE.5FC6CFAB@unix.sh>
Subject: Re: available resource declaration language(s)
Date:   Sun, 18 Mar 2001 10:17:32 -0500
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

> If you have a
> cluster - buy a switch.

Better yet, buy two so you don't have a SPOF.


A few comments on the larger issue of how to distribute information...

Broadcast is anathema to many network administrators, and in some networks
it's simply not available, so arguments about X% of total network bandwidth
are meaningless.  Having every node send to N peers is pretty non-scalable
too.  A ring, tree, mesh, torus, or N-dimensional cube is likely to be much
more efficient, and the difficulty of maintaining any of these structures
can be surprisingly low.  At a theoretical level I'm a little leery of
MOSIX-like diffusion models that do not provide a *guarantee* that
information will get from X to Y in a timely manner, but as a practical
matter it seems to work quite well and it certainly scales.

Should resource information be piggybacked on heartbeats?  Probably not,
IMO.  They have different requirements for reliability, bounded delivery
time, etc.  Heartbeats should remain reasonably small and must be processed
expeditiously, whereas resource advertisements might be arbitrarily large
and are not time-critical.  Having a receiver delay handling of heartbeat N
because it's still processing the resource information from heartbeat N-1
would be unacceptable, and you just know that many receiver implementations
would end up doing just that.

There's enough in common between the two problems that maybe there's some
common mechanism that could be used for both, perhaps using a QoS/priority
notion to deal with the differences.  But just dumping the resource info on
top of the heartbeats strikes me as a mistake.



Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sun Mar 18 20:14:01 2001
Received: by humbolt.nl.linux.org id <S92306AbRCRTNW>;
	Sun, 18 Mar 2001 20:13:22 +0100
Received: from gw.xkey.com ([206.86.100.52]:65287 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92243AbRCRTMt>;
	Sun, 18 Mar 2001 20:12:49 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id LAA29369; Sun, 18 Mar 2001 11:12:46 -0800
Received: from ip179.frederick.md.pub-ip.psi.net(38.14.105.179) by happy.xkey.com via smtp (V1.3)
	id sma029364; Sun Mar 18 11:12:42 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f2IJDTw08001;
	Sun, 18 Mar 2001 14:13:29 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Sun, 18 Mar 2001 14:13:29 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     "David L. Nicol" <david@kasey.umkc.edu>
Cc:     linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
Message-ID: <20010318141328.A6018@wumpus.hpti.com>
Mail-Followup-To: "David L. Nicol" <david@kasey.umkc.edu>,
	linux-cluster@nl.linux.org
References: <3AB2B8A3.13385A04@kasey.umkc.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3AB2B8A3.13385A04@kasey.umkc.edu>; from david@kasey.umkc.edu on Fri, Mar 16, 2001 at 07:06:43PM -0600
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Fri, Mar 16, 2001 at 07:06:43PM -0600, David L. Nicol wrote:

> Could the way the clustered machines find out about each other be
> standardized?

I doubt it; existing protocols are quite diverse and attempts to
standardize it have not succeeded.

Legion, for example, has objects called "collections", which keep a
database of resource availability for a set of machines, updated by
both push and pull protocols. A machine may be in more than one
collection, and you can have meta-collections. This provides a very
flexible ability to create personalized collections for the hosts
you're allowed to use, etc etc.

Globus stores resource availability info in one or more LDAP
databases. I don't know if it's updated by push or pull.

In the clusters I build, resource availability is maintained by the
queue system (PBS), which periodically pulls the info from the
nodes. That's the traditional way to do it.

> Has anyone done any serious simulations of the efficiency of various
> discovery methods? For instance, it is easy to imagine a virtual
> ring architecture in which each node shares everything it knows
> about all other nodes in a larger packet which is sent around the
> ring

I suspect that other issues like reliability play a more important
role than efficiency.

> Thoughts?  Pointers to masters' theses?

I believe that both Legion and Globus have published papers about
their resource discovery and scheduling algorithms.

-- greg


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sun Mar 18 21:52:00 2001
Received: by humbolt.nl.linux.org id <S92304AbRCRUvZ>;
	Sun, 18 Mar 2001 21:51:25 +0100
Received: from perninha.conectiva.com.br ([200.250.58.156]:35588 "EHLO
        postfix.conectiva.com.br") by humbolt.nl.linux.org with ESMTP
	id <S92297AbRCRUuw>; Sun, 18 Mar 2001 21:50:52 +0100
Received: from burns.conectiva (burns.conectiva [10.0.0.4])
	by postfix.conectiva.com.br (Postfix) with SMTP id D1C3B16B11
	for <linux-cluster@nl.linux.org>; Sun, 18 Mar 2001 17:50:49 -0300 (EST)
Received: (qmail 12811 invoked by uid 0); 18 Mar 2001 20:50:07 -0000
Received: from dial11.ras.conectiva (HELO imladris.rielhome.conectiva) (root@10.0.8.11)
  by burns.conectiva with SMTP; 18 Mar 2001 20:50:07 -0000
Received: from localhost (riel@localhost)
	by imladris.rielhome.conectiva (8.11.2/8.11.2) with ESMTP id f2IKhjE13779;
	Sun, 18 Mar 2001 17:43:46 -0300
X-Authentication-Warning: imladris.rielhome.conectiva: riel owned process doing -bs
Date:   Sun, 18 Mar 2001 17:43:45 -0300 (BRST)
From:   Rik van Riel <riel@conectiva.com.br>
X-Sender: riel@imladris.rielhome.conectiva
To:     David Santo Orcero <irbis@orcero.org>
Cc:     Alan Robertson <alanr@unix.sh>,
        "David L. Nicol" <david@kasey.umkc.edu>, linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
In-Reply-To: <Pine.LNX.4.30.0103180716510.3147-100000@atenea.orcero.org>
Message-ID: <Pine.LNX.4.21.0103181738490.13050-100000@imladris.rielhome.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Sun, 18 Mar 2001, David Santo Orcero wrote:

>  I am begining to thing that we will need two different APIs, that
> will be different kernel options. One is HA ckusters and other HP
> clusters. They are SO different -you are showing me this- that I find
> quite dificult to do a HP+HA API.

No need to have a common API.  The important part (if we want
to avoid duplicate work) is sharing _components_. Whether they
are heartbeat, lock manager or data sharing/replication mechanisms
doesn't matter as long as we manage to avoid too much duplication
of effort.

The main point in this list is to give the clustering projects
a forum to share each other's components, instead of having each
of the projects reinvent their wheels in various incompatible
ways ;)

Btw, a #clustering channel on irc.openprojects.net has been
started, it could be a nice place to hang out ...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar 19 05:18:43 2001
Received: by humbolt.nl.linux.org id <S92284AbRCSESL>;
	Mon, 19 Mar 2001 05:18:11 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:38140 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92282AbRCSERm>; Mon, 19 Mar 2001 05:17:42 +0100
Received: from unix.sh (unknown [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id D4F47178D8; Sun, 18 Mar 2001 21:17:24 -0700 (MST)
Message-ID: <3AB58854.20112D06@unix.sh>
Date:   Sun, 18 Mar 2001 21:17:24 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     David Santo Orcero <irbis@orcero.org>
Cc:     "David L. Nicol" <david@kasey.umkc.edu>, linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
References: <Pine.LNX.4.30.0103180716510.3147-100000@atenea.orcero.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

David Santo Orcero wrote:
> 
>  Hello, all!


[snip]

>  I am begining to thing that we will need two different APIs, that will be
> different kernel options. One is HA ckusters and other HP clusters. They
> are SO different -you are showing me this- that I find quite dificult to
> do a HP+HA API.

I would suggest that we not give up quickly on the idea of having common
components and APIs.  I believe that some implementations of components will
be better for some applications and configurations than others, but in many
cases they serve the same purpose - but in different ways.

And, just as you pointed out, my view isn't sufficient for everyone, and I'm
sure you would agree that your view isn't sufficient for everyone either.

For example, we've talked about membership.  Most (all?) cluster
applications need some form of membership services.  But different
implementations of membership services provide different characteristics in
their implementation.  By analogy, one might call them quality of service
(QOS).  Using QOS as an analogy, most applications need networking, but some
need low latency, and others need predictable packet times, and others need
high bandwidth.

But, most use TCP/IP for the transport, in spite of their different needs. 
This analogy is stretched a little thin, but there are similarities.

In our case, your cluster might need a very low bandwidth solution, and mine
might need quick discovery of dead nodes.  But - we both need to be able to
tell what machines are in the cluster, and what ones are out of it.

So, if we design an framework which would allow us to plug in different
loadable modules to provide these services, then one could assemble a
cluster out of one's favorite components - and create a solution which
solves one's problem better than any fixed solution can.

Identifying the right components and designing a sufficiently flexible and
lightweight and general framework (APIs, base software, etc.) is not simple,
unfortunately.

> > The most important conclusion I draw from this interchange is that we MUST
> > create a framework into which we can plug various methods, and have the
> > client applications not care at all.  If we create such a framework, then
> > the technologies can fight it out, and the winner will always be the user.
> 
>  Perfect. I am not going to talk about HA, but HP; and I thing that in
> that case the framework would have the following guidelines -it is my
> proponsal-. I will use as reference the four things that I have used more
> -MPI, PVM, Mosix and Beowulf-. The four are completly differeng things,
> but I will not talk about implementation, but features; it is a wish list.

> 
> 1) The cluster have to be a semantics. PVM have a semantics, MPI have a
> semantics, Mosix have a semantics. Maybe Mosix one is better -the whole
> cluster is shown to the user as a SMP machine-. Mosix does this, thus it
> is possible. Maybe one of the hot points of the discussion is deciding
> what semantics is better for a HP cluster.
> 
> 2) It should be a efficient method to send a task from the beginning to
> the least loaded node  of the cluster. PVM have this, Mosix have not -in
> Mosix the task can migrate after being launched, but its kernel part will
> be executed on the launch node-.

A cluster batch scheduler presumably could be a help here...

> 3) It should be portable between different Linux architectures. Mosix are
> not, the others are. (For me, it does not matter; but I know groups that
> will find great this).
> 
> 4) The network will be as transparent af we could. Mosix is great for
> this, PVM and MPI does a good work, and Beowulf does nothing.
> 
> 5) It must to allow to run cheap hardware efficiently. this take out
> broadcast protocols, sorry. ;-)

I would state this differently.  It must be possible to assemble a set of
components that allows it to work efficiently on cheap hardware.  I would
also argue that it must be possible to assemble a set of components that
allow it to take advantage of clusters with more bandwidth.

There is also a class of applications (like weather prediction) where the
system needs to be HA/HPC.  The US weather bureau wants to perform a set of
calculations and always have it finish on time, including automatically
completing successfully when nodes fail in the middle.  If they need to buy
more hardware for redundancy, they will.  There are other examples as well.
> 
> 6) Migrating running task is great. Mosix does a good work here, but not
> perfect -sockets and shared memory code can not migrate-.

For HA, automatic process migration in the kernel is a hinderance - not a
help.  It makes it difficult to figure out what has failed, and to restart
it on still-working nodes.  MOSIX's current implementation is what I'd call
low-availability: For a 2-node cluster, having one node fail can cause all
processes on both nodes to die.  This is not good for HA.

It also makes performance unpredictable.  If you're short on cycles (like
you describe yourself), then this can be a big problem. 
Application-directed restarts are much harder, but often better performing
as well.  If you have more human than technological resources, this might be
a better choice.

Nevertheless, there is a class of services which all clusters have in
common.

The following examples come to mind:
You need control communication, you need membership, you need high-bandwidth
communication, reset services, etc.

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar 19 14:13:03 2001
Received: by humbolt.nl.linux.org id <S92308AbRCSNMa>;
	Mon, 19 Mar 2001 14:12:30 +0100
Received: from [200.206.140.111] ([200.206.140.111]:51976 "EHLO
        atenea.orcero.org") by humbolt.nl.linux.org with ESMTP
	id <S92277AbRCSNMA>; Mon, 19 Mar 2001 14:12:00 +0100
Received: from localhost (localhost.localdomain [127.0.0.1])
	by atenea.orcero.org (8.11.0/8.11.0) with ESMTP id f2JAA6X02399;
	Mon, 19 Mar 2001 07:10:06 -0300
Date:   Mon, 19 Mar 2001 07:10:06 -0300 (BRT)
From:   David Santo Orcero <irbis@orcero.org>
To:     Alan Robertson <alanr@unix.sh>
cc:     "David L. Nicol" <david@kasey.umkc.edu>,
        <linux-cluster@nl.linux.org>
Subject: Re: available resource declaration language(s)
In-Reply-To: <3AB58854.20112D06@unix.sh>
Message-ID: <Pine.LNX.4.30.0103190620130.2371-100000@atenea.orcero.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list



 Hello, Alan!

> I would suggest that we not give up quickly on the idea of having common
> components and APIs.  I believe that some implementations of components will

 It is not giving up the idea; I thing that there is some subsystems that
may be common, and some not. I thing that the API would be different, but
we can rehuse losts of code between the different projects.

 Anyway, maybe the most important part is that the projects were not
mutually incompatible. That means that somebody could enable HA and HP
kernel options at the same. At this time, appling simultaneusly the
parches of two different projects is really an headache -when it works-.

> And, just as you pointed out, my view isn't sufficient for everyone, and I'm
> sure you would agree that your view isn't sufficient for everyone either.

 I agree completly with this. My view is very focused on HP clusters on
CPU-bounded process; it is not the only one use of clusters, but I thing
that it is an important one. Not the only one -HA, network-bounded and
disk-bounded are also very important-; but it is mine one, and is the one
that I know best. I am doing some brainstorming to all of us -including me- get
a broather view of the full thing. When we have a clear common objetive, I
also will like to colaborate with some code. :-) At last, all the
discussion will finish coding. ;-)

> For example, we've talked about membership.  Most (all?) cluster
> applications need some form of membership services.  But different


 Maybe this is one of the points where we can find a common solution, and
reuse lots of code. IMHO our comon goals are:

1) A system call to explicitily include or exclude a node on the cluster
2) The kernel structure with the list of nodes of the cluster, with
metrics of the quality of the node.
3) The manipulation routines of that kernel structure
4) For each node, the best IP to forward the packets related to kernel
cluster messages to reach better to the node (very interesting on complex
topologies; on my own experience is a thing that It helps, because I use
several network cards attached to each control node, and there is more
that a way to reach some nodes on the cluster. I would like that NFS
packets and kernel cluster messages uses different ways; actually Mosix
get crazy with some topologies). Hacking the routing tables does not work;
because we want that kernel packets take other different route. This can
also help to you, as far as you can have two network cards for node, and
one of them is ony for HA messages; you avoid that a malicious node full
the channel and block HA messages.
5) Routines to drop a node that goes down;  it include freeing the kernel
resources atached to the remote  node.

 All this things can be common on all clustering projects, and we can take
any existing implementation and use it. IMHO there will be, anyway, things
that  will be different and must be choosen during the kernel:

1) Policy on detecting when a remote node must be dropt. We can allow to
choose a HA policy -your code is perfect- and to choose a lazy policy -a
node is dropt when you try to contact him to send a work repeately and he
does not answer.
2) If node transparency is allowed (Mosix code?). This includes: common
PID table -there exists some Beowulf code about this, and PVM have some
great ideas-, process migration, launching new processes at least loaded
node. This is not good for HA, thus it would be some that you must
activate via /proc
 3) Policy to forward IP packets to node: using or not the IP
address forward field or not. The IP forward field could be also enabled via
/proc, and IP can also be choosen the adress via /proc.


> (QOS).  Using QOS as an analogy, most applications need networking, but some
> need low latency, and others need predictable packet times, and others need
> high bandwidth.


 You are right. What about my comment before?

> In our case, your cluster might need a very low bandwidth solution, and mine
> might need quick discovery of dead nodes.  But - we both need to be able to
> tell what machines are in the cluster, and what ones are out of it.

 You are completly right. Anyway, lots of things -file locking, internal
clustering structures, can be common.The rest is to choose CuQOS -cluster
Quality of service- ;-) parameters.

> > the least loaded node  of the cluster. PVM have this, Mosix have not -in
> > Mosix the task can migrate after being launched, but its kernel part will
> > be executed on the launch node-.
>
> A cluster batch scheduler presumably could be a help here...

 Yes; but in the (2) point before we must to include this ability,
transparent to the user, giving the SMP semantics to the whole kernel;
this must be optionaly, activated via /proc, because HA people would not
like this; you would like to know exactly where is each software running.

> > 5) It must to allow to run cheap hardware efficiently. this take out
> > broadcast protocols, sorry. ;-)
>
> I would state this differently.  It must be possible to assemble a set of
> components that allows it to work efficiently on cheap hardware.  I would
> also argue that it must be possible to assemble a set of components that
> allow it to take advantage of clusters with more bandwidth.

 You are right. This could be also a CuQOS option. :-) I thing that this
is included on the proponsal before.

> There is also a class of applications (like weather prediction) where the
> system needs to be HA/HPC.  The US weather bureau wants to perform a set of

 Well, if you have enough bucks, it is possible to do at the same time HA
and HPC. It is only activating all the CuQOS option before. ;-)

> > 6) Migrating running task is great. Mosix does a good work here, but not
> > perfect -sockets and shared memory code can not migrate-.
>
> For HA, automatic process migration in the kernel is a hinderance - not a
> help.  It makes it difficult to figure out what has failed, and to restart


 You are completly right here. But for HP it is the heaven; I must to
recognice that there were a before and a after of discovering the process
migration. That is why I thing that the migration should be on the
hypotetical clustering kernel as an kernel option.

> It also makes performance unpredictable.  If you're short on cycles (like
> you describe yourself), then this can be a big problem.

 This is not exact. In Mosix you are sure that you are going to use the
resources of the system at the higher level. Prof. Amnon has his
mathemathical demostration of this. Yes, you are not true if one
particular process will have some particular performance, but if the
overall is the best, you have an high probability of running faster.;-)


> Application-directed restarts are much harder, but often better performing
> as well.  If you have more human than technological resources, this might be

 Sorry, but no. I have tested it. On the worst case -an artificially
created programme to difficult the work of Mosix, migration is nearly as
good as the best case, and the different is very little -some minutes on a
work of a little more than a week-. On real cases, Mosix clearly
overperforms PVM with a directed politics of node asignment. (Tested with
some different mollecular modeling  packages, and some different QM
packages)

 Hope that my opinion helps. :-)

 Yours:


 David

---------------------
http://www.orcero.org
  irbis@orcero.org
---------------------


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar 19 21:12:19 2001
Received: by humbolt.nl.linux.org id <S92325AbRCSULj>;
	Mon, 19 Mar 2001 21:11:39 +0100
Received: from hilbert.umkc.edu ([134.193.4.60]:11016 "HELO tesla.umkc.edu")
	by humbolt.nl.linux.org with SMTP id <S92322AbRCSUKo>;
	Mon, 19 Mar 2001 21:10:44 +0100
Received: (qmail 91540 invoked from network); 19 Mar 2001 20:09:28 -0000
Received: from nicol1.umkc.edu (HELO kasey.umkc.edu) (david@134.193.4.62)
  by hilbert.umkc.edu with SMTP; 19 Mar 2001 20:09:28 -0000
Message-ID: <3AB6676A.7EC89120@kasey.umkc.edu>
Date:   Mon, 19 Mar 2001 14:09:14 -0600
From:   "David L. Nicol" <david@kasey.umkc.edu>
Organization: University of Missouri - Kansas City   supercomputing infrastructure
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i586)
X-Accept-Language: en
MIME-Version: 1.0
To:     David Santo Orcero <irbis@orcero.org>
CC:     linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
References: <Pine.LNX.4.30.0103170610350.988-100000@atenea.orcero.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

David Santo Orcero wrote:
 
> >  For instance, it is easy to imagine a virtual ring architecture 
> 
>  This idea will work fine on little networks, but I strondly doubt that
> something like these scale well. If you have only one token, you have some
> technical problems.

The technical problems are not insurmountable.  Who said the virtual ring
must have only one token, or there must be only one virtual ring?
 
> 1) What will happed if a node hangs if he is with the packet? (it will
> hang the whole parallel process in cluster, you will need a complex
> negociation rule to create a new token, like in token ring networks)

like, if you need a resource and you have not received a token recently
you send out a token.  The implementation of a robust token ring architecture
is not only tractable, but well documented.
 
> 2) What will happed if a malicious/erroneous node send a new second token
> to the network? (on a P2P protocol, a malicious/erroneus node have
> shorter posibilities to damage the network; anyway, Mosix solution is also
> far away of being safe)

if the route of the token can alter, or fall back on p2p methods, there can
be error recovery
 
> 3) If the network is really BIG -500, 600 nodes- the delay to get the
> token will be a problem. Somebody can say -buy a faster network-; but it
> is better not to force to the user spend more bucks because we use a worse
> solution.

so the network divides into teams of ten, based on subnet (I hope all 500
nodes are not on the same LAN segment) and a token is passed w/in each team,
and the team spontaneously elects a Decurion to present the team to the other
49 Decurions, via p2p or virtual token ring architecture at that level.

For on instance of full presentation among 10 nodes, with p2p there are 90
communications required (not counting the omphalotic case) and with VTR there
are ten communications required.
 
>  The three problems can be solved in a broadcast net, as proposed? No! If
> you have enougth nodes, you will flood the network; that is why I am
> really sure that using broadcast features it will not work.

Broadcast scales better than p2p -- it is p2p that floods a large network.
Rings and broadcasts both scale better than peer-to-peer, because there
is less redundant information streaming around. 

>  Personally I think that, independent of using Mosix arch or other
> completly thing, the architecture and the protocolls must be peer-2-peer,
> and all negociations must be distributed, using a random query as method
> of beginning a new P2P negociation; exactly as Mosix does. This will give
> to us an escalable and failure-proof method to negociate the share of the
> resources and to exchange information with other nodes.

How about a peer-to-peer abstraction over an arbitrary discovery mechanism,
in which data has a variable TTL, similar to, or even built on top of, DNS?

type RR into http://www.rfc-editor.org/cgi-bin/rfcsearch.pl to see what there
is available in DNS Resource Records.  That kind of thing might be better as
an IETF standard than a linux extension...  It's monday and I can move mountains :)


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar 19 22:31:26 2001
Received: by humbolt.nl.linux.org id <S92318AbRCSVao>;
	Mon, 19 Mar 2001 22:30:44 +0100
Received: from hilbert.umkc.edu ([134.193.4.60]:26891 "HELO tesla.umkc.edu")
	by humbolt.nl.linux.org with SMTP id <S92316AbRCSVaZ>;
	Mon, 19 Mar 2001 22:30:25 +0100
Received: (qmail 92551 invoked from network); 19 Mar 2001 21:29:10 -0000
Received: from nicol1.umkc.edu (HELO kasey.umkc.edu) (david@134.193.4.62)
  by hilbert.umkc.edu with SMTP; 19 Mar 2001 21:29:10 -0000
Message-ID: <3AB67A17.FE5D52EF@kasey.umkc.edu>
Date:   Mon, 19 Mar 2001 15:28:56 -0600
From:   "David L. Nicol" <david@kasey.umkc.edu>
Organization: University of Missouri - Kansas City   supercomputing infrastructure
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i586)
X-Accept-Language: en
MIME-Version: 1.0
To:     Alan Robertson <alanr@unix.sh>
CC:     David Santo Orcero <irbis@orcero.org>,
        "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: RE Re: available resource declaration language(s)
References: <Pine.LNX.4.30.0103180716510.3147-100000@atenea.orcero.org> <3AB58854.20112D06@unix.sh>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


> > 2) It should be a efficient method to send a task from the beginning to
> > the least loaded node  of the cluster. PVM have this, Mosix have not -in
> > Mosix the task can migrate after being launched, but its kernel part will
> > be executed on the launch node-.
> A cluster batch scheduler presumably could be a help here...

mosix is compatible with the farm-to-least-loaded-node mechanisms, not
exclusive of them; their web pages used to proudly proclaim that mosix would
supercharge a distribution mechanism for even greater throughput.

> For HA, automatic process migration in the kernel is a hinderance - not a
> help.  It makes it difficult to figure out what has failed, and to restart
> it on still-working nodes.  MOSIX's current implementation is what I'd call
> low-availability: For a 2-node cluster, having one node fail can cause all
> processes on both nodes to die.  This is not good for HA.

processes on a node that suddenly dies appear, in Mosix, to have been
given a kill -9 signal.  If your worker child processes can handle getting
kill-9ed without the whole house of cards collapsing, M. can become higher
availablity than the situation you describe.


-- 
                      David Nicol 816.235.1187 dnicol@cstp.umkc.edu
  He who says it's impossible shouldn't interrupt the one doing it.


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar 19 22:54:51 2001
Received: by humbolt.nl.linux.org id <S92314AbRCSVyS>;
	Mon, 19 Mar 2001 22:54:18 +0100
Received: from alanina.df.ibilce.unesp.br ([200.145.203.25]:41746 "EHLO
        andalucia2.orcero.org") by humbolt.nl.linux.org with ESMTP
	id <S92286AbRCSVxu>; Mon, 19 Mar 2001 22:53:50 +0100
Received: from localhost (irbis@localhost)
	by andalucia2.orcero.org (8.10.1/8.9.3) with ESMTP id f2JJScO05040;
	Mon, 19 Mar 2001 16:28:38 -0300
Date:   Mon, 19 Mar 2001 16:28:38 -0300 (BRT)
From:   David Santo Orcero <irbis@orcero.org>
X-Sender: irbis@andalucia2.trantor
To:     "David L. Nicol" <david@kasey.umkc.edu>
cc:     linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
In-Reply-To: <3AB6676A.7EC89120@kasey.umkc.edu>
Message-ID: <Pine.LNX.4.21.0103191555001.4694-100000@andalucia2.trantor>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


 Hello, David!

> > something like these scale well. If you have only one token, you have some
> > technical problems.
> 
> The technical problems are not insurmountable.  Who said the virtual ring
> must have only one token, or there must be only one virtual ring?

 The problem is on the physical layer and on the link layer, I was talking
about colitions. Using a virtual ring is not going to help you too
much, because... err... is virtual. I tried this to comunicate
subpopulations on a GA -once again, is sw at user level, but it does no
difference for the physical layer of the network- and you have that kind
of problems. And with two tokens things could be worst if one token run
faster than the another. Yes, it is possible: the two tockens runs at the
same velocity on the media, but you have also the delay on the propagation
on the nodes. If a node is working and delays to forward the package, you
will hace a colition, and the two tockens will be lost.

> like, if you need a resource and you have not received a token recently
> you send out a token.  The implementation of a robust token ring architecture
> is not only tractable, but well documented.

 Yes, I know. They are so well documented that I have to study on my
degree that protocols, and they are not trivial. You will have to
implement nearly the full 802.5 over UDP to do this work, and this is not
an easy work. Please, dowload and read the full IEEE 802.5 before
continuing this discusion.

> if the route of the token can alter, or fall back on p2p methods, there can
> be error recovery

 It is not so easy. Read a token ring protocoll.

> so the network divides into teams of ten, based on subnet (I hope all 500
> nodes are not on the same LAN segment) and a token is passed w/in each team,
> and the team spontaneously elects a Decurion to present the team to the other
> 49 Decurions, via p2p or virtual token ring architecture at that
>level.

 Yes, I supose that hobody has 500 nodes on the same LAN segment, basicaly
because it is not possible on the most of the networks -there are
limitations on the maximun size of a network cable for ethernet, and
a minimum distance between the nodes; it is not theoretical, test it-.

 There is only one problem on your solution: you can not
say to a user : "hey, guy, if you wanna use linux for clustering you gotta
use this topology". Basically, because not allways the guy that is
instaling Linux have enouth power to recableate all the non-dedicated
nodes.

 Anyway, yo work with Legion, isn't it?

> Broadcast scales better than p2p -- it is p2p that floods a large network.

 :-?

 Where do you take that piece of information? p2p really floods a
network? I thing that you are talking about Gnutella. Gnutella does not
scalle well, but it is not a problem of the p2p part. The problem is that
it broadcast the information of the songs. What it worst, really do not
use the IP broadcast facilities; it use a flooding algorithm to broadcast
the information of the query of the songs. And if you use a flooding
algorithm, you flood the network.

 p2p protocols are those involving the connection of two peers only. That
is why they scale well because err... the involve only two peers. They
have problems, that have been put here and does them unsuitable for HA,
but I cann't understand where is the problem on flooding. Maybe you have
hearing some about Gnutella, but it is not a p2p problem, the problem is
using a flooding algorithm to transmit information.

> Rings and broadcasts both scale better than peer-to-peer, because there
> is less redundant information streaming around. 


 8-O


 I am not going to discuss this. Test it. Then we discuss.

 Yours:

David


 
---------------------
http://www.orcero.org
  irbis@orcero.org
---------------------


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar 19 23:04:40 2001
Received: by humbolt.nl.linux.org id <S92200AbRCSWEA>;
	Mon, 19 Mar 2001 23:04:00 +0100
Received: from emcmail.lss.emc.com ([168.159.48.78]:22508 "EHLO emc.com")
	by humbolt.nl.linux.org with ESMTP id <S92166AbRCSWDl>;
	Mon, 19 Mar 2001 23:03:41 +0100
Received: from emc.com (lub1012.lss.emc.com [168.159.39.12])
	by emc.com (8.10.1/8.10.1) with ESMTP id f2JLwXx15686;
	Mon, 19 Mar 2001 16:58:33 -0500 (EST)
Message-ID: <3AB68113.8030809@emc.com>
Date:   Mon, 19 Mar 2001 16:58:43 -0500
From:   Ric Wheeler <ric@emc.com>
Reply-To: ric@emc.com
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.10 i686; en-US; m18) Gecko/20010131 Netscape6/6.01
X-Accept-Language: en
MIME-Version: 1.0
To:     "David L. Nicol" <david@kasey.umkc.edu>
CC:     Alan Robertson <alanr@unix.sh>,
        David Santo Orcero <irbis@orcero.org>,
        "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: RE Re: available resource declaration language(s)
References: <Pine.LNX.4.30.0103180716510.3147-100000@atenea.orcero.org> <3AB58854.20112D06@unix.sh> <3AB67A17.FE5D52EF@kasey.umkc.edu>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


In MOSIX, you can do what we do with our 80 node cluster which is to
use the load gathering infrastructure to select the least loaded node 
and create it in place on that node (mexec() does this).  Other
clusters that I have worked on call this type of thing "remote spawn"
(instead of fork and migrate).

ric

David L. Nicol wrote:

>>> 2) It should be a efficient method to send a task from the beginning to
>>> the least loaded node  of the cluster. PVM have this, Mosix have not -in
>>> Mosix the task can migrate after being launched, but its kernel part will
>>> be executed on the launch node-.
>> 
>> A cluster batch scheduler presumably could be a help here...
> 
> 
> mosix is compatible with the farm-to-least-loaded-node mechanisms, not
> exclusive of them; their web pages used to proudly proclaim that mosix would
> supercharge a distribution mechanism for even greater throughput.
> 






Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar 19 23:10:45 2001
Received: by humbolt.nl.linux.org id <S92237AbRCSWKO>;
	Mon, 19 Mar 2001 23:10:14 +0100
Received: from gw.xkey.com ([206.86.100.52]:13838 "EHLO happy.xkey.com")
	by humbolt.nl.linux.org with ESMTP id <S92200AbRCSWJ4>;
	Mon, 19 Mar 2001 23:09:56 +0100
Received: (from smtp@localhost) by happy.xkey.com
	id OAA10583 for <linux-cluster@nl.linux.org>; Mon, 19 Mar 2001 14:09:54 -0800
Received: from ip30.frederick.md.pub-ip.psi.net(38.14.105.30) by happy.xkey.com via smtp (V1.3)
	id sma010579; Mon Mar 19 14:09:52 2001
Received: (from lindahl@localhost)
	by localhost.hpti.com (8.11.0/8.11.0) id f2JMAYS03106
	for linux-cluster@nl.linux.org; Mon, 19 Mar 2001 17:10:34 -0500
X-Authentication-Warning: localhost.hpti.com: lindahl set sender to lindahl@conservativecomputer.com using -f
Date:   Mon, 19 Mar 2001 17:10:34 -0500
From:   Greg Lindahl <lindahl@conservativecomputer.com>
To:     linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
Message-ID: <20010319171034.A3089@wumpus.hpti.com>
Mail-Followup-To: linux-cluster@nl.linux.org
References: <Pine.LNX.4.30.0103170610350.988-100000@atenea.orcero.org> <3AB6676A.7EC89120@kasey.umkc.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3AB6676A.7EC89120@kasey.umkc.edu>; from david@kasey.umkc.edu on Mon, Mar 19, 2001 at 02:09:14PM -0600
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Mon, Mar 19, 2001 at 02:09:14PM -0600, David L. Nicol wrote:

> Broadcast scales better than p2p -- it is p2p that floods a large network.

Generalizations are always wrong. Your statement is not true for
networks like Myrinet that have to fake broadcasts, and it's also not
necessarily true for highly switched ethernet networks -- there it
depends on the details of what's broadcast. N tiny broadcast packets
take up a lot more bandwidth than 1 larger packet containing all the
info. And then there's congestion...

-- g

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Mon Mar 19 23:44:44 2001
Received: by humbolt.nl.linux.org id <S92192AbRCSWoE>;
	Mon, 19 Mar 2001 23:44:04 +0100
Received: from alanina.df.ibilce.unesp.br ([200.145.203.25]:5651 "EHLO
        andalucia2.orcero.org") by humbolt.nl.linux.org with ESMTP
	id <S92166AbRCSWng>; Mon, 19 Mar 2001 23:43:36 +0100
Received: from localhost (irbis@localhost)
	by andalucia2.orcero.org (8.10.1/8.9.3) with ESMTP id f2JKID105558;
	Mon, 19 Mar 2001 17:18:13 -0300
Date:   Mon, 19 Mar 2001 17:18:13 -0300 (BRT)
From:   David Santo Orcero <irbis@orcero.org>
X-Sender: irbis@andalucia2.trantor
To:     Ric Wheeler <ric@emc.com>
cc:     "David L. Nicol" <david@kasey.umkc.edu>,
        Alan Robertson <alanr@unix.sh>,
        "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: RE Re: available resource declaration language(s)
In-Reply-To: <3AB68113.8030809@emc.com>
Message-ID: <Pine.LNX.4.21.0103191711410.4694-100000@andalucia2.trantor>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list



 Hello, Ric!

> 
> In MOSIX, you can do what we do with our 80 node cluster which is to
> use the load gathering infrastructure to select the least loaded node 
> and create it in place on that node (mexec() does this).  Other
> clusters that I have worked on call this type of thing "remote spawn"
> (instead of fork and migrate).

 Mosix and the things taht you do are complementary. Mosix can't migrate a
process BEFORE executing it, and you can'tmigrate after executing. This
menas that if you do 100 forks in Mosix, you will have the 100 process
running at their home node when they run in kernel level. By other way, if
you have 20 process in 10 nodes and in 5 nodes all process finished, you
can not migrate the running 10 process that overloads 5 nodes to the 5
wasted nodes. That is what mosix does well. I thing that the two works are
complementary, because each one do things that de other do not do.

 Yours:

David


---------------------
http://www.orcero.org
  irbis@orcero.org
---------------------


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Tue Mar 20 04:23:55 2001
Received: by humbolt.nl.linux.org id <S92178AbRCTDX1>;
	Tue, 20 Mar 2001 04:23:27 +0100
Received: from perninha.conectiva.com.br ([200.250.58.156]:63759 "EHLO
        postfix.conectiva.com.br") by humbolt.nl.linux.org with ESMTP
	id <S92166AbRCTDXJ>; Tue, 20 Mar 2001 04:23:09 +0100
Received: from burns.conectiva (burns.conectiva [10.0.0.4])
	by postfix.conectiva.com.br (Postfix) with SMTP id 3147416BDB
	for <linux-cluster@nl.linux.org>; Tue, 20 Mar 2001 00:23:06 -0300 (EST)
Received: (qmail 31698 invoked by uid 0); 20 Mar 2001 03:22:25 -0000
Received: from dial10.ras.conectiva (HELO imladris.rielhome.conectiva) (root@10.0.8.10)
  by burns.conectiva with SMTP; 20 Mar 2001 03:22:25 -0000
Received: from localhost (IDENT:riel@localhost [127.0.0.1])
	by imladris.rielhome.conectiva (8.11.2/8.11.2) with ESMTP id f2K36OR23642;
	Tue, 20 Mar 2001 00:06:24 -0300
Date:   Tue, 20 Mar 2001 00:06:24 -0300 (BRST)
From:   Rik van Riel <riel@conectiva.com.br>
X-Sender: riel@imladris.rielhome.conectiva
To:     David Santo Orcero <irbis@orcero.org>
Cc:     Alan Robertson <alanr@unix.sh>,
        "David L. Nicol" <david@kasey.umkc.edu>, linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
In-Reply-To: <Pine.LNX.4.30.0103190620130.2371-100000@atenea.orcero.org>
Message-ID: <Pine.LNX.4.21.0103200004460.13050-100000@imladris.rielhome.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Mon, 19 Mar 2001, David Santo Orcero wrote:

> > I would suggest that we not give up quickly on the idea of having common
> > components and APIs.  I believe that some implementations of components will
> 
>  It is not giving up the idea; I thing that there is some subsystems
> that may be common, and some not. I thing that the API would be
> different, but we can rehuse losts of code between the different
> projects.

That would only be true if you have a monolithic piece of
clustering software.

When you actually "plug in" every part on a component basis,
the APIs used between components can be equal for both types
of clusters...

(eg.  "grab lock" or "send message")

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Tue Mar 20 05:32:51 2001
Received: by humbolt.nl.linux.org id <S92252AbRCTEcW>;
	Tue, 20 Mar 2001 05:32:22 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:9715 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92237AbRCTEcC>; Tue, 20 Mar 2001 05:32:02 +0100
Received: from unix.sh (unknown [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id 9342317930; Mon, 19 Mar 2001 21:31:45 -0700 (MST)
Message-ID: <3AB6DD31.D6F5B9C9@unix.sh>
Date:   Mon, 19 Mar 2001 21:31:45 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     Rik van Riel <riel@conectiva.com.br>
Cc:     David Santo Orcero <irbis@orcero.org>,
        "David L. Nicol" <david@kasey.umkc.edu>, linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
References: <Pine.LNX.4.21.0103200004460.13050-100000@imladris.rielhome.conectiva>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Rik van Riel wrote:
> 
> On Mon, 19 Mar 2001, David Santo Orcero wrote:
> 
> > > I would suggest that we not give up quickly on the idea of having common
> > > components and APIs.  I believe that some implementations of components will
> >
> >  It is not giving up the idea; I thing that there is some subsystems
> > that may be common, and some not. I thing that the API would be
> > different, but we can rehuse losts of code between the different
> > projects.
> 
> That would only be true if you have a monolithic piece of
> clustering software.
> 
> When you actually "plug in" every part on a component basis,
> the APIs used between components can be equal for both types
> of clusters...
> 
> (eg.  "grab lock" or "send message")

Rik!  You stole my thunder!	 ;-)

What we've (wombat and I) have had in mind is a clustering framework into
which various components can be plugged.  This is an extension of the
architecture which heartbeat currently uses internally.

[I had talked to Rik about it on IRC].

	-- Alan Robertson
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Tue Mar 20 05:42:55 2001
Received: by humbolt.nl.linux.org id <S92230AbRCTEmZ>;
	Tue, 20 Mar 2001 05:42:25 +0100
Received: from cs.huji.ac.il ([132.65.16.10]:21749 "EHLO cs.huji.ac.il")
	by humbolt.nl.linux.org with ESMTP id <S92192AbRCTElw>;
	Tue, 20 Mar 2001 05:41:52 +0100
Received: from mos218.cs.huji.ac.il ([132.65.173.218] ident=mail)
	by cs.huji.ac.il with esmtp (Exim 3.20 #1)
	id 14fDxh-0000gI-00
	for linux-cluster@nl.linux.org; Tue, 20 Mar 2001 06:41:37 +0200
Received: from amnons by mos218.cs.huji.ac.il with local (Exim 3.16 #1)
	id 14fDxg-0006QQ-00
	for linux-cluster@nl.linux.org; Tue, 20 Mar 2001 06:41:36 +0200
Subject: MOSIX objectives
To:     linux-cluster@nl.linux.org
Date:   Tue, 20 Mar 2001 06:41:36 +0200 (IST)
X-Mailer: ELM [version 2.5 PL3]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <E14fDxg-0006QQ-00@mos218.cs.huji.ac.il>
From:   Amnon Shiloh <amnons@cs.huji.ac.il>
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Let me present MOSIX's view of clustering objectives:

The early versions of MOSIX, many years ago, consisted of a true
Single-System Image: although each process had its initial root,
there was one file-system, connecting the roots via the "/.../" directory,
where each node's file-systems could be accessed via "/.../m{node}/".
The "stat"/"fstat" system-calls were extended to provide the node-number
(as well as the device and inode numbers), there was no home-node, and by
using "chroot" one could completely disassociate from any particular
node.  The process-IDs were 31 bits, to allow for 15-bit node-numbers.
The "sync" system-call presented some problem as well.

The above design required massive changes to the Unix kernel (about 60%
of the kernel code was modified) as well as some changes to user-mode
code, then the whole user-mode source-tree had to be recompiled.

This was still possible in the days of Sys-V.2, where all the utilities
took about 10MB. Today nobody would imagine doing this again for all the
Terabytes of Linux user-mode code.

Yes - it is possible to design a nice Unix-like SSI operating-system,
but it wouldn't be Linux, and someone will have to review and possibly
modify all user-land applications. The effect on the kernel would also
be far more massive, and not being able to rely on main-stream drivers,
someone will have to follow up with the hardware-drivers of about 1000
different devices, different buses, chips-sets, memory, APM, IRQs, etc.
Unless one is happy to run their applications on Sys V.2, only Linux
can carry this weight.

In 1992, we "relaxed" the idea of SSI in favor of remaining 100% compatible
(source and binary) with the underlying operating-system. The "new MOSIX"
is based on the "home-model" in which all the user's processes are connected
to and are seen as if they run on the home node. The result is a new MOSIX
kernel architecture which requires modifications of no more than 5% of the
kernel. It attempts to provide SMP functionalities in a scalable cluster.
In the first stage we developed a set of algorithms for efficient management
of the cluster-wide resources by process migration. Other projects that we
intend to develop include DSM and migratable sockets.

As for High-Availability, we think that it is a good idea - but not the
responsibility of the kernel.  It is best done in user-mode, but if
someone comes up with a good user-mode scheme that only requires a bit
of kernel assistance, we will happily try to help provide that support.

Similarly, we look at PVM, MPI and Beowulf as good tools for those
who care to invest more in programming.  They may provide improved
I/O (although MOSIX is closing the gap with DFSA) and initial-assignment
of very short-lived processes.  This is not in contradiction with
dynamic process-migration taking care of further adjustments.  In
fact, initial-assignment can be made to a set of fully-fledged nodes,
while the load can then be distributed to include a larger number of
diskless nodes.

There are many ways to use MOSIX: the kernel provides many flexible
alternatives providing higher-level schemes with automatic, semi-automatic
or manual process-migrations.

Amnon Shiloh -- the HUJI MOSIX group.


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Tue Mar 20 05:44:48 2001
Received: by humbolt.nl.linux.org id <S92252AbRCTEoS>;
	Tue, 20 Mar 2001 05:44:18 +0100
Received: from usw-dsl-225.102.denco.rmi.net ([166.93.225.102]:30963 "EHLO
        laptop.linux-ha.org") by humbolt.nl.linux.org with ESMTP
	id <S92248AbRCTEn6>; Tue, 20 Mar 2001 05:43:58 +0100
Received: from unix.sh (unknown [127.0.0.1])
	by laptop.linux-ha.org (Postfix on SuSE Linux 7.0 (i386)) with ESMTP
	id 381F717947; Mon, 19 Mar 2001 21:43:43 -0700 (MST)
Message-ID: <3AB6DFFF.D3CDDFB4@unix.sh>
Date:   Mon, 19 Mar 2001 21:43:43 -0700
From:   Alan Robertson <alanr@unix.sh>
Organization: Linux-HA
X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     Rik van Riel <riel@conectiva.com.br>
Cc:     David Santo Orcero <irbis@orcero.org>,
        "David L. Nicol" <david@kasey.umkc.edu>, linux-cluster@nl.linux.org
Subject: Re: available resource declaration language(s)
References: <Pine.LNX.4.21.0103181738490.13050-100000@imladris.rielhome.conectiva>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Rik van Riel wrote:
> 
> On Sun, 18 Mar 2001, David Santo Orcero wrote:
> 
> >  I am begining to thing that we will need two different APIs, that
> > will be different kernel options. One is HA ckusters and other HP
> > clusters. They are SO different -you are showing me this- that I find
> > quite dificult to do a HP+HA API.
> 
> No need to have a common API.  The important part (if we want
> to avoid duplicate work) is sharing _components_. Whether they
> are heartbeat, lock manager or data sharing/replication mechanisms
> doesn't matter as long as we manage to avoid too much duplication
> of effort.

Rik and I talked about this on IRC...
Code sharing is greatly aided by common interfaces.  One can always do such
a thing without it, but it's always a mess, and rarely the kind of thing 
that an OSS developer would willingly do.

> The main point in this list is to give the clustering projects
> a forum to share each other's components, instead of having each
> of the projects reinvent their wheels in various incompatible
> ways ;)

This was, of course, my point as well.  Except that I'd maintain that
creating APIs that allow you to access various implementations in a common
way is a GoodThing(tm).

> Btw, a #clustering channel on irc.openprojects.net has been
> started, it could be a nice place to hang out ..

	-- Alan Robertosn
	   alanr@unix.sh

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar 22 09:26:14 2001
Received: by humbolt.nl.linux.org id <S92194AbRCVIZk>;
	Thu, 22 Mar 2001 09:25:40 +0100
Received: from ah-svl-office.arrowhead.se ([195.22.75.66]:26357 "EHLO
        beregond.AD.ARROWHEAD.SE") by humbolt.nl.linux.org with ESMTP
	id <S92166AbRCVIZM>; Thu, 22 Mar 2001 09:25:12 +0100
Received: from arrowhead.se (IDENT:joh@beregond [127.0.0.1])
	by beregond.AD.ARROWHEAD.SE (8.11.0/8.11.0) with ESMTP id f2M8SnV06201
	for <linux-cluster@nl.linux.org>; Thu, 22 Mar 2001 09:28:50 +0100
Message-ID: <3AB9B7C1.E81CFBD7@arrowhead.se>
Date:   Thu, 22 Mar 2001 09:28:49 +0100
From:   Josef =?iso-8859-1?Q?H=F6=F6k?= <josef.hook@arrowhead.se>
Organization: Arrowhead
X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.1-XFS i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     linux-cluster@nl.linux.org
Subject: Cluster question
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Anyone have an idea on which cluster system would be best suited for
running applications like OpenLDAP and Sendmail..


/joh


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar 22 12:01:30 2001
Received: by humbolt.nl.linux.org id <S92166AbRCVLAs>;
	Thu, 22 Mar 2001 12:00:48 +0100
Received: from sarajevo.idealx.com ([213.41.87.90]:51015 "EHLO
        sarajevo.idealx.com") by humbolt.nl.linux.org with ESMTP
	id <S92164AbRCVLAY>; Thu, 22 Mar 2001 12:00:24 +0100
Received: from calvin.UUCP (uucp@localhost)
	by sarajevo.idealx.com (8.10.1/8.10.1) with UUCP id f2MAx9N24740
	for linux-cluster@nl.linux.org; Thu, 22 Mar 2001 11:59:09 +0100 (CET)
Received: from tsm by calvin.ird.IDEALX.com with local (Exim 3.22 #1 (Debian))
	id 14g2nw-0000mK-00
	for <linux-cluster@nl.linux.org>; Thu, 22 Mar 2001 11:58:56 +0100
Date:   Thu, 22 Mar 2001 11:58:56 +0100
From:   Thierry Mallard <thierry.mallard@IDEALX.com>
To:     linux-cluster@nl.linux.org
Subject: Re: Cluster question
Message-ID: <20010322115856.D2405@IDEALX.com>
References: <3AB9B7C1.E81CFBD7@arrowhead.se>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.3.15i
In-Reply-To: <3AB9B7C1.E81CFBD7@arrowhead.se>; from josef.hook@arrowhead.se on Thu, Mar 22, 2001 at 09:28:49AM +0100
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

On Thu, Mar 22, 2001 at 09:28:49AM +0100, Josef Höök wrote:
> Anyone have an idea on which cluster system would be best suited for
> running applications like OpenLDAP and Sendmail..

I think it depends on what part of clustering you mean : 

- high availability (don't need to deal with concurrency)
- load balancing 
- parallelized processing

Since it's my first post on the list, I may be off topic because possibly this
list is dedicated to, say, load-balancing (I don't know ;-)

Best regards,
-- 
Thierry Mallard                    | GnuPG key on pgp.ai.mit.edu
http://IDEALX.com                  | key 0xA3D021CB
http://thierry.mallard.com         | 



Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Thu Mar 22 13:01:10 2001
Received: by humbolt.nl.linux.org id <S92202AbRCVMAn>;
	Thu, 22 Mar 2001 13:00:43 +0100
Received: from ah-svl-office.arrowhead.se ([195.22.75.66]:27374 "EHLO
        beregond.AD.ARROWHEAD.SE") by humbolt.nl.linux.org with ESMTP
	id <S92175AbRCVMAS>; Thu, 22 Mar 2001 13:00:18 +0100
Received: from arrowhead.se (IDENT:joh@beregond [127.0.0.1])
	by beregond.AD.ARROWHEAD.SE (8.11.0/8.11.0) with ESMTP id f2MC3sV15998
	for <linux-cluster@nl.linux.org>; Thu, 22 Mar 2001 13:03:55 +0100
Message-ID: <3AB9EA2A.33EBDCE2@arrowhead.se>
Date:   Thu, 22 Mar 2001 13:03:54 +0100
From:   Josef =?iso-8859-1?Q?H=F6=F6k?= <josef.hook@arrowhead.se>
Organization: Arrowhead
X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.1-XFS i686)
X-Accept-Language: en
MIME-Version: 1.0
To:     linux-cluster@nl.linux.org
Subject: Re: Cluster question
References: <3AB9B7C1.E81CFBD7@arrowhead.se> <20010322115856.D2405@IDEALX.com>
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

Thierry Mallard wrote:

> On Thu, Mar 22, 2001 at 09:28:49AM +0100, Josef Höök wrote:
> > Anyone have an idea on which cluster system would be best suited for
> > running applications like OpenLDAP and Sendmail..
>
> I think it depends on what part of clustering you mean :
>
> - high availability (don't need to deal with concurrency)

>
> - load balancing

>
> - parallelized processing
>

It would be Parallelized processing that im intrested in..
Any ideas on which cluster system would be best suited for that, and
i dont want to rewrite OpenLDAP for MPI stuff ..

:)

regards /joh


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Fri Mar 23 23:47:44 2001
Received: by humbolt.nl.linux.org id <S92210AbRCWWrO>;
	Fri, 23 Mar 2001 23:47:14 +0100
Received: from 49-MADR-X46.libre.retevision.es ([62.83.25.49]:19438 "EHLO
        carlos.mosix.net") by humbolt.nl.linux.org with ESMTP
	id <S92202AbRCWWqk>; Fri, 23 Mar 2001 23:46:40 +0100
Received: from carlos by carlos.mosix.net with local (Exim 3.12 #1 (Debian))
	id 14gaMi-0000At-00
	for <linux-cluster@nl.linux.org>; Fri, 23 Mar 2001 23:49:04 +0100
Subject: global vision of the system
From:   carlos manzanedo <cvsrep@wanadoo.es>
To:     linux-cluster@nl.linux.org
Content-Type: text/plain
X-Evolution: 00000003-0000
Mime-Version: 1.0
X-Mailer: Evolution 0.5.1 (Developer Preview)
Date:   23 Mar 2001 21:49:03 -0100
Message-Id: <E14gaMi-0000At-00@carlos.mosix.net>
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list

	CARLOS AND WEWE

it's difficult to me to have a global vision of the system at the actual
phase of development.correct me if i'm wrong please.
we have got two behaviours ,we are searching the things that join those
parts of the system and at now we have got that:

 layer 1:
	heartbeat: i don't know a lot about it just begin to read about
	it this week, but as long as i have seen it is perfect to this
	layer.
	It will send keepalive messages (HA and HP) and it will manage
	the transmission of data to do whatever you want (dfs, HA HP..)
	i think this is correct because of its capacities in :
	membership, authentication
	very customable and simple.

	must be several servers that have to manage the variable's
	packets sent in a broadcast net (as Orcero said), for example a
	server for HP other one for HA, and maybe one to a new
	distributed fs that someone will develop ( it will manage info
	about files not the files themselves).
This layer must be capable of change latencies , variable's type ,very
well designed to be  24x7, even (maybe) QoS.it should also be associated
with a unique
net interface,in that way we can have 2 clusters (in the same nodes,each
node with two net interfaces) one ha, other hp without affecting the
latencies of the beats.
Works that the layer will do:
	include nodes to the system (authentication )
	be aware when a node is shut down.
	membership

This layer is common to HA and HP.

¿what's the best way to do this user lever or kernel level , i think it
will be good leave it at user lever (we could have several servers in
each node, maybe configured by text files XML?)(in mosix that logical
layer hang up the kernel when you do an nmap to one node of the cluster)
?

layer 2:
	API that give to the upper layer the structures and hooks and
	functions necessaries to make a customable cluster.
	This will be general functions as way the processes migrate (HA
	surely won't need that, that's why it will be also customable),
	migratable sockets, maybe virtual shared memory , and every
	function a cluster may need, this layer will be in kernel level,
	the functions must be selected in compilation time (virtual
	shared memory and all these things). In that way we can make a
	consistent layer over which everyone can develop an application
	over it in a way that it can change the algorithms that decide
	the policy of migration , policy of location or whatever other
	policy.

Layer 3:
Distributed application layer.


section of ascii art.

     |----------------------------|
     |         APLICATION                            |
     | Algorithms to manage                     |
     | migrations or whatever...                 |
     |                                                        |
     |---------------------------------------------|
     |   API that provides
general                                                 |
     |   resource
managment                                                       |
     |                                                                                          |
     |---------------------------------------------|-|- - - - - - - - -
- -
     |     HEARTBEAT                               | |a way to
custom                                             |
     |                                             | | Heartbeat
behavior                                                |
     |_____________________________________________|_|_ _ _ _ _ _ _ _ _
_ _|

here i've got a confusion about the reletionship between the resource
management layer and the heartbeat layer.there are to ways they can
interact ,first is
the proposal i've told, other one is :

the distributed aplication would recive all the information of the
hertbeat
packages and take care with it,it aplicate the algorithms that these
especific
aplication need to solve it's problem and then use the resource
managment api.
in that way ,an especific aplication may not need the heartbeat
layer,only must
migrate the proccess that the administrator said in a gui enviroment ...

other ascii section

                      |--------------|
                      | APLICATION         |
                      |______________|
                      /                           \
                    /                               \
                  /                                   \
      |--------------|                  |--------------------|
      | HEARTBEAT          |                 | RESOURCE MANAGMENT  |
      |______________|                  |____________________|



I've already think about the name of the hp proyect...  may it be ANT M?
(have you seen the film ANT Z ?) as stallman said,the better part of
developing gnu is the program names ,i also thik that mosix is the
better aproach to HP cluster , but it should be better designed or
coded so why not to call it ANT M (Ant m is Not  True Mosix) ,it also
have other meanings , (the work that a group of ants can manage ...)

sugestions are wellcome.



Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Sat Mar 24 01:31:21 2001
Received: by humbolt.nl.linux.org id <S92202AbRCXAar>;
	Sat, 24 Mar 2001 01:30:47 +0100
Received: from hilbert.umkc.edu ([134.193.4.60]:40207 "HELO tesla.umkc.edu")
	by humbolt.nl.linux.org with SMTP id <S92193AbRCXAaC>;
	Sat, 24 Mar 2001 01:30:02 +0100
Received: (qmail 162780 invoked from network); 24 Mar 2001 00:28:30 -0000
Received: from nicol1.umkc.edu (HELO kasey.umkc.edu) (david@134.193.4.62)
  by hilbert.umkc.edu with SMTP; 24 Mar 2001 00:28:30 -0000
Message-ID: <3ABBEA31.99BB864D@kasey.umkc.edu>
Date:   Fri, 23 Mar 2001 18:28:33 -0600
From:   "David L. Nicol" <david@kasey.umkc.edu>
Organization: University of Missouri - Kansas City   supercomputing infrastructure
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i586)
X-Accept-Language: en
MIME-Version: 1.0
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: draft RFC: "umbrella"
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


Just think how much fun the graphics people will have with penguins
with umbrellas.


1: 

The umbrella framework can be implemented as a mounted file system,
or it can be implemented as one or more demons that monitor a set of
files or pipes.  Although a more direct syscall type of interface may
be available, the portable interface, which must be supported for a system
to claim "compliance" is all file interface based.

the framework can be inserted anywhere into the FS, for instance 

	mount /opt/umbrella/initscript /umbrella -t umbrella

or
	/opt/umbrella/takedir /umbrella

would both amount to defining umbrella with a target directory of /umbrella.

Each service is either a wservice, that you write to, or a rservice,
that you read from, to allow fully functional prototyping with fifos.


2:


At initialization time, umbrella creates a nodes/ directory w/in the
target directory, and a meta pipe within the target directory too. Each
defined node will be represented as a directory under nodes, for instance
/umbrella/nodes/fred/ or /umbrella/nodes/3/ .  The contents of the per-node
directories is defined by the specific service, or by attribute=value commands.

3:

Umbrella configuration directives are given to umbrella by opening ./meta
for writing and writing them to it.  Security is by capability keys which are
generated any time a change is made, or by permission bits, but capabilities
are be preferred because permission bits are so easy to fake in cluster situations.

Initially defined directives include:


configure and deconfigure nodes:

	add <node-name> [keyfile=filename to write a key for this node to]
		[<attribute>=<value>].....

	del[ete] <node-name> [key given at node addition]


configure per-node services

The service programs will receive the complete path to the service as
their first command-line argument.  When a per-node service is 
configured, a <service-name>.lock file appears in the target directory
for optional flock synchronization.

	wser[vice] <service-name> <path to program to run when
				   service-name is invoked, by writing
				   into it>

	rser[vice] <service-name> <path to program to run when
				   service-name is invoked, by reading
				   from it>


configure per-cluster services

These services appear in the target directory rather than each node directory.

	cwser[vice] <service-name> <path>
	crser[vice] <service-name> <path>

Extend the language of what you can echo > /umbrella/meta

	define <keyword> <path to handler>


Capability management:

Keys are random strings of alphanumerics.  

Resources are anything a system using umbrella capabilities wants
to name.

Result is either "0\n\0" or "1\n\0"

	copy <key> <path to file you will write the new key with
			single-use ability to represent what <key>
			can do >

	new <resource_name> <path where new long-term key goes>

	check <resource_name> <key> <path where result goes>	


Project Coexistence:

	As long as multiple projects do not use the same
names for either resources or services, they can coexist under the
same umbrella.  For instance, someone who is porting MOSIX from
procfs to umbrella might want to change the name of the "status"
service to something like "mstatus."	  It is hoped that as 
different projects are ported to this framework, ones that provide
similar -- fully replacement of each other -- services will give
their services the same names and the same access languages, for
drop-in replacement.



Have I left anything out, that really is common to everything?








-- 
                      David Nicol 816.235.1187 dnicol@cstp.umkc.edu
  He who says it's impossible shouldn't interrupt the one doing it.


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

From owner-linux-cluster@nl.linux.org Tue Mar 27 21:40:09 2001
Received: by humbolt.nl.linux.org id <S92231AbRC0TjV>;
	Tue, 27 Mar 2001 21:39:21 +0200
Received: from hilbert.umkc.edu ([134.193.4.60]:11536 "HELO tesla.umkc.edu")
	by humbolt.nl.linux.org with SMTP id <S92221AbRC0Tij>;
	Tue, 27 Mar 2001 21:38:39 +0200
Received: (qmail 225246 invoked from network); 27 Mar 2001 19:37:06 -0000
Received: from nicol1.umkc.edu (HELO kasey.umkc.edu) (david@134.193.4.62)
  by hilbert.umkc.edu with SMTP; 27 Mar 2001 19:37:06 -0000
Message-ID: <3AC0EBE2.3D3DBC7A@kasey.umkc.edu>
Date:   Tue, 27 Mar 2001 13:37:06 -0600
From:   "David L. Nicol" <david@kasey.umkc.edu>
Organization: University of Missouri - Kansas City   supercomputing infrastructure
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i586)
X-Accept-Language: en
MIME-Version: 1.0
To:     "linux-cluster@nl.linux.org" <linux-cluster@nl.linux.org>
Subject: Re: draft RFC: "umbrella"
References: <3ABBEA31.99BB864D@kasey.umkc.edu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-linux-cluster@nl.linux.org
Precedence: bulk
Return-Path: <owner-linux-cluster@nl.linux.org>
X-Orcpt: rfc822;linux-cluster-list


After thinking about this all weekend I realized two or three
places where it could be improved.  Also, I think a good first
couple of demonstrator items will be, a communications layer, because
that is so infrastructure that an explanation of Why I Left It Out is required,
and r* commands, that use the communications layer but could use another
communications layer that provided the same interface.

Why I Left Out A Communications Layer From Friday's Draft:

Because a general system needs to work with arbitrary communications systems!

Many are tied to ip address, for instance, but that breaks if part of your
cluster is behind a NAT firewall.  But if you are implementing an umbrella
module for compatibility with such a system, that is the communications you would
use.

I imagine a good communication layer might multiplex all communication between
any two nodes over one SOCK_STREAM that is between the two, or one for each level
of QOS, for instance a control channel and a bulk channel, like FTP, but reusing
the bulk channel.

With a multiplexed system that is hidden, a per-node open service, which opens
a new SOCK_STREAM to something on the other end, might be invoked with a syntax
that has two arguments, <service> and <path/filename> where <service> is some
information for the remote node as to what it is supposed to do with this new
connection and <path/filename> is a spot in the local file system where a unix
socket can be placed, for connection and use.


Thoughts?  How do you wrap a file name into a sockaddr, anyway?





> Capability management:

I left out "duplicate for persistent use" and "revoke" which are
additional capability primitives.


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/

