The Resurrection of bWatch

Jeff Layton

Bringing back a tool from the early days of Beowulf.

When the world was dominated by dinosaurs, a new beast arose from the depths: one that relied on the Earth's community to grow and thrive. It was Beowulf. No, not the Scandinavian warrior, but an approach to high-performance computing (HPC) that uses common x86 processors, conventional Ethernet networking, the Message Passing Interface (MPI) or a parallel virtual machine (PVM), and Linux.

With Beowulf, everything was new. Previous HPC systems all had proprietary tools to manage and monitor servers (nodes) in the cluster, so the tools for Beowulf clusters had to be developed, including those to monitor clusters. During the first few years of this Clusterian period of HPC, one of the key requirements was a simple tool to monitor nodes – something simple that would give the status of all nodes in the cluster.

A visual presentation of the load on the nodes (with the uptime command), node uptime, and maybe even some memory usage would let you determine the status of a cluster with a quick glance at the screen. At that time, lots of people were experimenting with and developing clusters, so many were homemade and small: A 60-node cluster was considered large, so you would easily be able to see all of the nodes on one screen with a little scrolling.

bWatch

Python was still in early days, but Tcl/Tk was very popular. The high-level, general-purpose, interpreted and dynamic Tool Command Language (Tcl) was created by John Ousterhout in the 1990s. From the name you can probably tell that it was intended to be a language for writing tools for whatever platform you were using. (I will begrudgingly admit there are more platforms than Linux and other Unix-like operating systems.) Because it was interpreted, you didn’t have to change code and recompile constantly. You could just change the code and run.

Ousterhout also needed a tool for developing graphical user interfaces (GUIs) for Tcl applications, so about one year after starting Tcl, he created Tk, which has expanded beyond just Tcl with bindings to many other languages. Tk provides widgets commonly used for desktop applications such as buttons, menus, a canvas, text, labels, and so on.

To create something useful for Beowulf clusters, Jacek Radajewski used these two tools to write a simple monitoring tool, bWatch . Although simple in concept, when first released, a huge portion of people using and developing Beowulf clusters used it. The first tine I used it, I was very excited because I could get a visual representation of the state of the cluster (Figure 1), rather than having to log in to each node or use a command-line tool to run parallel commands on all the nodes and parse through the data.

Figure 1: Sample screenshot of bWatch (Jacek Radajewski).

bWatch concepts are pretty simple: The idea is to gather information about the node by opening a shell on that node and running Linux commands. Originally, bWatch used the remote shell (rsh ) to execute shell commands on another node. Although rsh is very insecure (it transmits everything without encryption), at the time, it was commonly used, so bWatch used it.

bWatch shells to the node and runs the uptime command to find the current time on the node, the number of users, how long the node has been up, and the 1-, 5-, and 15-minute load averages on the node. Isn’t it amazing how much information is contained in this one command?

bWatch also runs the command

cat /proc/meminfo

to gather information about the memory being used. With the output from just this command and uptime , you can get an idea of the status of the node.

Running bWatch

One recent morning, I saw the word "bWatch" on a dry erase board in my home office. Evidently at some point in the last five years or so, I wrote that, but I couldn't remember exactly why. Regardless, that morning I decided to try to get bWatch running again on today's Linux and hardware – or at least give it a good try, even though my Tcl knowledge is gone.

The last version of bWatch posted is 1.1.0a with a date stamp of June 15, 2004, which makes it about 21 years old now. There's no chance this will work, right? I wanted to try it anyway, but I will probably have to make some changes along the way.

wishx and ssh

bWatch is written in Tcl, which I didn’t remember at all, but it’s fairly easy to read. The first line in the bWatch.tcl code is the shebang (#!/bin/ < something > ) that points to the command or tool the code is supposed to run. In the case of bWatch, it is #!/user/bin/wishx (Windowing Shell), a TCL and Tk interpreter. However, wishx has since given way to just plain wish , which I easily installed on my Ubuntu 24.04 system; then, I changed the interpreter from wishx to wish on the first line of the code. Easy enough.

bWatch uses rsh , so I knew I had to change that, as well. I searched through the code and rsh is defined in the line,

set command(rsh)                "/usr/bin/rsh"

so I changed the set command target to ssh :

set command(rsh)                "/usr/bin/ssh"

Everywhere else in the code, rsh itself is not used, only set(command) .

.bWatchrc.tcl File

The next issue was how to specify the node names to be used, which meant I had to check the README file, where I found that the list of nodes (hosts) is specified in a file named .bWatchrc.tcl in the root directory:

set listOfHosts {node1 node2 node3 node4 node5 node6 node7 node8}

I’m not sure what restrictions are imposed on node names, so I just used the output from the command hostname on all nodes. (I only used one node for this article.)

First Run

To install bWatch, use:

sudo make install

This command puts the code in /usr/local/bin , which is also in my default $PATH ; however, you might check that it is there. Next, I ran the command bWatch.tcl (Figure 2). Everything sort of looks good, but notice that the Shared Mem column has no values. Something is amiss.

Figure 2: First run of bWatch in 21 years on my systems.

When I ran bWatch, a wish console window popped up (Figure 3). The complete error message reads,

getSharedMemory(laytonjb-Precision-7680):child process exited abnormally

so something is going on with the getSharedMemory procedure.

Figure 3: wish console message.

Before digging into this error, I tried turning off data collection for getSharedMemory by clicking on the Options button, deselecting the Shared Memory option (Figure 4), then clicking Apply and OK to close the window.

Figure 4: Options selection box.

At the top of the main bWatch window, selecting Refresh causes bWatch to update the information for all nodes possible (Figure 5).

Figure 5: Refreshed bWatch.

Back to Shared Memory

Although missing just the shared memory information is probably not a big deal, it would be nice to get the data originally captured in bWatch. The procedure that gets the shared memory information in bWatch.tcl is:

proc getSharedMemory {host} {

global command bWatchDir

set errorCode \
[catch {set sharedMemory \
[exec cat $bWatchDir/bWatchMemInfo | grep MemShared]} errorMessage]
# check if the command returned an error
if {$errorCode} {
set date [getDate]
.console.messageList insert end \
"$date :: getSharedMemory($host) : $errorMessage"
return "error"
} else {
return $sharedMemory
}

Notice that it runs the command grep MemShared from the information gathered with cat /proc/meminfo . (You need to dig into the code a bit to find out that it stores the result from cat so it can be searched for the needed information.)

The output of cat /proc/meminfo from a modern Linux system won’t show MemShared . The closest output is Shmem , so you can just change the grep command to grep Shmem . If you do this and restart bWatch.tcl , you’ll see that the Shared Mem column reappears (Figure 6). Note that you might have to go back and re-select Shared Memory under Options to get it to show.

Figure 6: The reappearance of Shared Mem.

Summary

I remember using bWatch many years ago on my clusters, and I loved the simplicity of the GUI and its ease of configuration, along with the simple information it provided. After 21 years, I was surprised how easily I got bWatch working again. I think that shows the importance of keeping programming languages and GUIs somewhat fixed, along with LInux interfaces (e.g., /proc/meminfo ), with a few small exceptions. I also like that you don’t have to be an admin to use bWatch. Even a humble user can gather this data.

Give bWatch a whirl and poke around with some of the options (which you can save to your .bWatchrc.tcl file). Although there really is no need, I think it would be interesting to port bWatch to Python, because it would probably be faster than I relearning TCL/Tk.