[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cluster list



>> Mine:
>>
>> Each node consists of a CPU, local RAM, optional flash ROM, L3 cache,
>> and a multi-purpose chip that does DMA between nodes. There isn't any
>> expansion opportunity on the node, or even a PCI chip.
>>
>> So stop assuming clusters have network cards! If I had one, it would
>> be an add-on device accesed via a dumb (non-CPU) bridge node. There
>> are some problems here too... which node(s) control(s) the device?
>> Can (should) node 42 tell the device to use data in node 33?
>
> Well, focus on that kind of special-purpose nodes and how many people
> will test-use it, apart from you and your special-dma-chip supplier ?

I'd guess much of the code could be used on large Sun and SGI boxes.
Treating a 64-way system as SMP is going to hurt, even if it works.

Wasn't Larry McVoy proposing something along these lines? As I recall
it, the example was:

You have a NUMA box with 16 processors.
Processors are in groups of 4.
You run 4 kernels, each doing 4-way SMP.
You have kernel code to make the system appear unified.

The numbers could be from a big Intel box. (the 4-way Xeon limit)
The idea would work for Sun or SGI boxes. Depending on the
details, it ought to work on my hardware too.

(there is a summit meeting on this soon... wish I was invited)

> I think about nodes connected by some link; the easiest is ethernet
> (if the net is dedicated it does not to use tcp), but the communication
> device sould be something configurable, say 'node, start cluster
> membership on dev=xxx'.

Hardware really defines what you can and should be doing. For all
of the following, assume we want to run more than one kernel.
So "SMP" means a true SMP box that we split up for performance.
I use "mail" to mean data that gets queued on the receiver and
causes an interrupt, or data that is passed into the CPU to avoid
the need for queueing.

CPU-mapped access to memory on other nodes?
  No. (your Ethernet)
  Yes, for free. (large SMP system -- your memory is my memory)
  Yes, almost free. (huge-memory SMP system with a 32-bit CPU)
  Yes, slowly. (NUMA system)
  Yes, slowly, and you can't ignore it. (my system w/ huge memory)

DMA transfer access to memory on other nodes?
  No. (your Ethernet, typical SMP, and maybe some NUMA boxes)
  Yes. (my system, and maybe SCI networks or NUMA boxes)

Can "mail" a small (few bits or bytes) message to other nodes?
  Yes, but unreliable and with horrible latency. (your Ethernet)
  Yes, syncronously, w/o broadcast. (SMP IPI, maybe NUMA too)
  Yes, about 1000 may be queued before overflow. (my system)

Can "mail" a large (several bytes or kilobytes) message to other nodes?
  Yes, unreliably. (your Ethernet)
  No. (SMP, NUMA, my system... no network card to queue it!)

Can broadcast large "mail"?
  Yes, no problem. (your Ethernet)
  No. (SMP, NUMA, my system)

Can broadcast small "mail"?
  Yes, no problem. (your Ethernet)
  Yes. (SMP, NUMA)
  Yes, s-l-o-w-l-y. (my system)

Can broadcast DMA writes?
  No; no DMA. (your Ethernet, SMP, and maybe some NUMA boxes)
  Yes, slowly. (my system)

Can broadcast CPU-mapped access?
  No; no DMA access. (your Ethernet)
  No. (SMP and NUMA)
  Yes, s-l-o-w-l-y. (my system -- heh, don't try to broadcast read)

Take the above hardware differences, then multiply by 2 to the
power of the number of features (failover, process migration)
that people might want. Then considering that people work for
competing organizations with secrets and incompatible human
languages, it should be no surprise that little work is shared.

> If you boot each node diskless, clusterfs can read /etc/something
> where the device is marked, and a map between 'address' (ip or
> whatever) and node nunmber is given. In your case that map would
> be in flash, mine would be on nfs.

In your case, get rid of it.

Buy all your Ethernet cards from the same vendor, so only the last
three bytes of the MAC addresses differ. Those bytes are a 24-bit
node number. Assign IP addresses from the 10.x.y.z private-use area
according to node ID.

If your MAC is 0c:ad:13:00:02:05 then your node ID is 517 and
your IP address is 10.0.2.5.

Oh, such a hack... but for a wee bit of code, you get rid of all
the lookup tables. You can even kill ARP.

Linux-cluster: generic cluster infrastructure for Linux
Archive:       http://mail.nl.linux.org/linux-cluster/