[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

re[2]: Clusterwide pids



 >>  * Albert D. Cahalan (acahalan@cs.uml.edu) wrote:
 >>  > Lars Marowsky-Br\351e writes:
 >>  > >    "Albert D. Cahalan" <acahalan@cs.uml.edu> said:
 >>  > 
 >>  > >>> This leaves us with the chicken and egg problem - how do you
 >>  > >>> boot a node which is - at the time of boot - unable to contact
 >>  > >>> the cluster?
 >>  > >>
 >>  > >> You don't. It is a mistake to design for this perversion.
 >>  > >
 >>  > > Uh. A node not being able to join the cluster is a perfectly
 >>  > > reasonable exception, and you want it to boot so that you can
 >>  > > fix it over the network. It makes sense not to start any cluster
 >>  > > services, that is true.
 >>  > 
 >>  > 1. boot the single node without joining the cluster
 >>  > 2. fix the node
 >>  > 3. reboot to join the cluster

 >>  sounds like a windows solution ;-)  seriously though, if you can boot
 >>  and run processes that are local only (no cluster yet, as you haven't
 >>  been able to join for whatever reason), how does this cause problems when
 >>  you join the cluster?  sure you join the cluster with some machine state
 >>  other than a fresh reset, but the state is relative to the resources you
 >>  can share with the cluster.  and the act of joining the cluster should
 >>  initialize any cluster specific state on the node, no?

I think you are missing the point.

IF THE GOAL ...
                        is to get to a cluster with totally transparent IPC capability across the cluster (SSI IPC) then you have to have a unique pid for each process in the cluster.

For instance:   kill nnn   
must send a signal to just one process, not to several.

Ignoring the performance gains potentially available with SSI IPC, one of the points of SSI IPC is to allow administration of the cluster from any node.

For instance, if you have local pids for local processes and CPIDs only for cluster processes, then you have the difficult administrative task of telneting to the appropriate node to kill a runaway local process.

Yes, you can modify kill to accept a node parameter, but IPC is commonly used throughout lots of different admin tools, and it would get extremely difficult to support this mechanism.

In my opinion, it is superior to have CPIDs for all processes and thus we can start thinking seriously about SSI IPC.

If you have processes running prior to joining the cluster and you want CPIDs, then you have 2 basic choices I can think of:

1) Use the local/cluster process paradigm.
2) Have the kernel assume all pids with a node # of 0 are on the local node, and thus it puts the local node # into the cpid.

Far preferable to either of the above in my mind is to have two boot choices, standalone for maintenance, cluster for normal operation.

In the cluster boot situation there are again 2 basic choices:

1) Have a predetermined node #
2) Have the node # dynamically assigned, possibly by having a small 'cluster joining' app which is invoked prior to loading the kernel.  

In either case, if quarom is not available on the cluster at boot up, the kernel/app just sits and waits for more nodes to come alive.

Greg Freemyer
Internet Engineer
Deployment and Integration Specialist
The Norcross Group
www.NorcrossGroup.com


Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/