I think that the best way of describing the Beowulf supercomputer architecture is to use an example which is very similar to the actual Beowulf, but familiar to most system administrators. The example that is closest to a Beowulf machine is a Unix computer laboratory with a server and a number of clients. To be more specific I'll use the DEC Alpha undergraduate computer laboratory at the Faculty of Sciences, USQ as the example. The server computer is called beldin and the client machines are called scilab01, scilab02, scilab03, up to scilab20. All clients have a local copy of the Digital Unix 4.0 operating system installed, but get the user file space (/home
) and /usr/local
from the server via NFS (Network File System). Each client has an entry for the server and all the other clients in its /etc/hosts.equiv
file, so all clients can execute a remote shell (rsh) to all others. The server machine is a NIS server for the whole laboratory, so account information is the same across all the machines. A person can sit at the scilab02 console, login, and have the same environment as if he logged onto the server or scilab15. The reason all the clients have the same look and feel is that the operating system is installed and configured in the same way on all machines, and both the user's /home
and /usr/local
areas are physically on the server and accessed by the clients via NFS. For more information on NIS and NFS please read the NIS and NFS HOWTOs.
Now that we have some idea about the system architecture, let us take a look at how we can utilise the available CPU cycles of the machines in the computer laboratory. Any person can logon to any of the machines, and run a program in their home directory, but they can also spawn the same job on a different machine simply by executing remote shell. For example, assume that we want to calculate the sum of the square roots of all integers between 1 and 10 inclusive. We write a simple program called sigmasqrt
(please see source code) which does exactly that. To calculate the sum of the square roots of numbers from 1 to 10 we execute :
[jacek@beldin sigmasqrt]$ time ./sigmasqrt 1 10 22.468278 real 0m0.029s user 0m0.001s sys 0m0.024sThe
time
command allows us to check the wall-clock (the elapsed time) of running this job. As we can see, this example took only a small fraction of a second (0.029 sec) to execute, but what if I want to add the square root of integers from 1 to 1 000 000 000 ? Let us try this, and again calculate the wall-clock time.
[jacek@beldin sigmasqrt]$ time ./sigmasqrt 1 1000000000 21081851083600.559000 real 16m45.937s user 16m43.527s sys 0m0.108s
This time, the execution time of the program is considerably longer. The obvious question to ask is what can we do to speed up the execution time of the job? How can we change the way the job is running to minimize the wall-clock time of running this job? The obvious answer is to split the job into a number of sub-jobs and to run these sub-jobs in parallel on all computers. We could split one big addition task into 20 parts, calculating one range of square roots and adding them on each node. When all nodes finish the calculation and return their results, the 20 numbers could be added together to obtain the final solution. Before we run this job we will make a named pipe which will be used by all processes to write their results.
[jacek@beldin sigmasqrt]$ mkfifo output [jacek@beldin sigmasqrt]$ ./prun.sh & time cat output | ./sum [1] 5085 21081851083600.941000 [1]+ Done ./prun.sh real 0m58.539s user 0m0.061s sys 0m0.206s
This time we get about 58.5 seconds. This is the time from starting the job until all the nodes have finished their computations and written their results into the pipe. The time does not include the final addition of the twenty numbers, but this time is a very small fraction of a second and can be ignored. We can see that there is a significant improvement in running this job in parallel. In fact the parallel job ran about 17 times faster, which is very reasonable for a 20 fold increase in the number of CPUs. The purpose of the above example is to illustrate the simplest method of parallelising concurrent code. In practice such simple examples are rare and different techniques (PVM and PMI APIs) are used to achieve the parallelism.
The computer laboratory described above is a perfect example of a Cluster of Workstations (COW). So what is so special about Beowulf, and how is it different from a COW? The truth is that there is not much difference, but Beowulf does have few unique characteristics. First of all, in most cases client nodes in a Beowulf cluster do not have keyboards, mice, video cards nor monitors. All access to the client nodes is done via remote connections from the server node, dedicated console node, or a serial console. Because there is no need for client nodes to access machines outside the cluster, nor for machines outside the cluster to access client nodes directly, it is a common practice for the client nodes to use private IP addresses like the 10.0.0.0/8 or 192.168.0.0/16 address ranges (RFC 1918 http://www.alternic.net/rfcs/1900/rfc1918.txt.html). Usually the only machine that is also connected to the outside world using a second network card is the server node. The most common ways of using the system is to access the server's console directly, or either telnet or remote login to the server node from personal workstation. Once on the server node, users can edit and compile their code, and also spawn jobs on all nodes in the cluster. In most cases COWs are used for parallel computations at night, and over weekends when people do not actually use the workstations for every day work, thus utilising idle CPU cycles. Beowulf on the other hand is a machine usually dedicated to parallel computing, and optimised for this purpose. Beowulf also gives better price/performance ratio as it is built from off-the-shelf components and runs mainly free software. Beowulf has also more single system image features which help the users to see the Beowulf cluster as a single computing workstation.