[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: available resource declaration language(s)
David Santo Orcero wrote:
>
[snip]
> You are trasmiting a short package and wait if a NACK arrives.
> Maybe you never ever need establish a TCP connection, neither a simple UCP
> transmition; you have enouth with ICMP packets. Between this and full
> interchange of information about the sistem -uptime, system
> performance, number of process, number of empty slots on system tables,
> memory free and so on- there is a HUGE difference. Test it. No matter how
> small is your package, you will have to use UDP or TCP. TCP is impossible
> due to kernel tables limitations, and UDP is lots bigger that ICMP.
What would make you think about TCP? Certainly neither I, nor the article
mentioned it, nor is it in the code, and I certainly wouldn't recommend it.
Cluster membership already requires this keepalive data. You just piggyback
this other stuff on top of it. Zero additional packets are sent for this
data. I certainly wouldn't recommend the approach you are talking about
either ;-). I have *one* connection per machine for all control packets -
TOTAL. Not "n" connections per machine. TCP is stupid for this for lots of
reasons - like being O(N**2), etc. The overhead in kernel connection tables
doesn't go up with the number of machines in the cluster AT ALL. It is
constant - O(1).
It uses a multicast protocol. You can run this protocol on top of udp
broadcast, udp multicast, or (for really small clusters) serial ports, or
whatever unreliable transport you want to run it on. The way it works, is
that all control data is multiplexed across a single multicast channel -
which minimizes the overhead. This has other advantages that I don't have
time to go into here.
Read the article. Read the email you responded to ;-). Better yet, try the
code. I'd be delighted to hear how it works for you.
Too bad Wombat (Peter Badovinatz) is in Australia on vacation, so he can't
comment on it any time soon.. This is pretty similar to the proven approach
used by IBM's Phoenix clustering software. They deploy clusters of 500 or
more machines quite successfully. I'm sure he'll be quite disappointed to
learn it doesn't work ;-)
> > For a 1000 node system and a 150 byte heartbeat packet, and a 1 second
> > heartbeat interval, the bandwidth is approximately 1.2% of the bandwidth
> > available on an unswitched 100 Mbit network.
>
> Did you calculed this operation mathematically, or did you do the test on
> a cluster? The results may be completly different, due to the colisions.
> There is not a thing like "the amounth of information that a chanel can
> transmit" but "the package on the network uses physical space, and when
> one package is traveling, the rest must stop. And on the spreadest of the
> networks they don't, and they do colitions. Well, we can ask to the user:
> "you can't use ethernet to do clustering", but this will leave the most of
> the people out of the game.
Collisions are rarely a problem in a properly configured full-duplex
switched network. A switch is cheap. For example, the 24-port 100-mbit
full-duplex switch on my home network cost about $300 USD. If you have a
cluster - buy a switch. Even an expensive switch costs less than a single
node. Without a switch you can't really put together any kind of cluster.
> > If you double the packet size, it would rise to 2.5%. This is pretty small
>
> You are calculating mathematically, dividing the peak bit rate that
> can be obtained using the network by number of the bits that
> you transmit! If you double a packet size, the network usage on the cheap
> networks _never_ mutiplies by 2, due the colitions! As an clear example
> that anybody can test, if you send 3Mb on a second from one node and other
> 3Mb on a second on other node, it is not true that in a third node the
> information will arrive at 6Mb/seg! In fact, you will have lots of
> colitions on the channel.
Why aren't you running full-duplex switches? They *are* cheap.
I certainly understand that you can't run any ethernet channel full blast.
No problem. But you can run it *lots* faster than 1.2% full. Or, for the
smaller clusters you were talking about, 0.6% full. If modifying the
traffic by .6% of busy causes you a problem, your network is too close to
the edge, and will be in trouble in a few days when the load grows even if
all this traffic is removed.
> Let's assume a network that allows broadcast of node information
> and a full exchange of information between your 1000 nodes.
> Renember that the MAINTANCE of the data collected on a non-P2P solution
> will be also a problem: a O(n^2).
Each machine receives "n" updates per period of time. Updating an in-memory
table is pretty cheap. If you code it right, the time to update an entry
for a node is constant, so the total overhead is O(n).
> Let's assume that you have a more
> efficient algorithm, O(n).
Good assumption. See above.
> You will have the 1000 nodes, sending
> constantly information, you sending information constantly, and doing
> constant modification of the table. Maybe it will be a good solution to do
> Linux as efficient as Amoeba.
What an amazing paragraph!
In all cases you have 1000 nodes sending updates constantly. Each is doing
1 update per unit time. In all cases you have machines updating the tables
constantly. In my case, each machine performs "n" in-memory updates per
unit time.
How will you know which machines are working and which aren't unless you
have some kind of keepalive or heartbeat? This function (cluster
membership) is a necessary function, unless unreliable clusters are the only
kind you are interested in.
If your machines are not constantly sending information, then they're idle.
Sending information is not evil. All methods under discussion send AT LEAST
one update per unit time. That's what heartbeat does - one update packet
per unit time.
> If you do not broadcast, and you do P2P with random poll we will send
> few packets por second on your 1000 nodes network, and we will overload
> the kernel with a O(k) algorithm.
Of course it multicasts. (?!?).
If you change from multicast to unicast, you don't decrease the number of
packets sent, but poor implementations can easily increase it ;-). What you
do *clearly* change is the number of packets *received* by each node.
That's the improvement that you get from MOSIX's method.
Receiving fewer packets is nice. On the other hand, the implementation
complexity is higher, the latency on receiving changes in information from
nodes is higher, and you will have much greater difficulty telling quickly
and reliably if a node leaves the cluster unexpectedly. This latter piece
is the single most important property for a high-availability cluster.
============ Now to the important part of this note ;-) ==============
The thing I emphasized the most in my initial post was that whatever method
the
applications use to get this data must be standardized through a single
agnostic API. This discussion points out *clearly* why I believe this quite
passionately.
The Mosix method has some nice properties. The method I use has different
nice properties. But neither method has all the nice properties at once.
One causes fewer packet receptions but has high latency, poor membership
properties, and more complex code. One is simpler, has lower latency, good
membership properties, but causes more packet receptions.
One method is wonderful for some environments, the other wonderful for other
environments. Each works very well in certain niches, and the niches they
work best in aren't the same. This is normal. It's fine. In fact, it's
probably good!
The most important conclusion I draw from this interchange is that we MUST
create a framework into which we can plug various methods, and have the
client applications not care at all. If we create such a framework, then
the technologies can fight it out, and the winner will always be the user.
And when someone comes out with an even better method for doing this, or one
that serves a particular niche better, then we can just plug it in, and get
the benefits immediately.
-- Alan Robertson
alanr@unix.sh
Linux-cluster: generic cluster infrastructure for Linux
Archive: http://mail.nl.linux.org/linux-cluster/