Desktop Supercomputers: Past, Present, and Future

Desktop supercomputers give individual users control over compute power to run applications locally at will.

In two previous articles, I discussed the forces that led to the rise and development of desktop supercomputers, including the history of supercomputer processors and the open source tools and communities that drove it. Along the way, I used the phrase “desktop” to mean systems that can sit beside a desk (i.e., a “deskside” system). In this article, I review the first desktop supercomputers, focusing on what they offered, along with some comments about why they failed. However, this lack of commercial success does not mean that these systems were not influential, which is precisely why I want to discuss them.

After looking at these systems, I will talk about present-day desktop supercomputers. Even if you do not see the phrase “desktop supercomputer” regarding current systems, that does not mean they do not exist in a vibrant range of options. Being an engineer, I like to think about future options in this space. Like many other people, I like performance that can solve problems faster, solve larger problems, or allow for new approaches to solving problems.

Before jumping into all these discussions, it is important to review power, because desktop supercomputers need to be able to plug into standard power outlets in offices and homes. Without the ability to use standard outlets, the systems would reside in a data center, which would defeat the purpose of being a desktop system.

Power

Homes, offices, and labs are not data centers and have power and cooling limitations. Standard wall sockets in the US are 120V, and common amperage values in the home are 15 and 20A. A 15A circuit has a capability of 1,800W (120 × 15 = 1,800), and a 20A circuit has a capability of 2,400W.

The US National Electrical Code (NEC) rules and best practices state that the design wattage for typical residents is 80% of the maximum. The maximum usable wattage values are then 1,920W (20A), 1,440W (15A), and 960W (10A). Twenty-amp circuits are common in offices and homes. However, in homes, these circuits are sometimes broken up (e.g., into two 10A circuits).

Past Desktop Supercomputers

Desktop (deskside) supercomputers are not new, but three past systems were pivotal. Orion Multisystems released the first purpose-built desktop supercomputer, which used a revolutionary low-power processor and a custom processing board. The second system, the TyanPSC Typhoon, switched gears to use commodity processors and motherboards in a custom deskside chassis. A third system, from Cray, used commodity processors with custom motherboards and chassis. It focused on incorporating Cray’s high-performance computing (HPC) expertise into a personal system.

Orion Multisystems

Transmeta Corporation created a novel CPU that had its own instruction set and used software to translate other CPU instructions to “Transmeta” instructions. A software layer, called Code Morphing Software (CMS), translated instructions from other CPUs to Transmeta instructions. The goal of this approach was to design a very efficient low-power processor without the constraints of an existing instruction set architecture (ISA). The CMS comprised an interpreter, a runtime system, and a dynamic binary translator – all of it just software. All that was needed for a Transmeta processor to run a different ISA was software. Theoretically, then, costs should be much less than designing and fabricating a CPU.

The first CPU from Transmeta was the Crusoe processor launched in January 2000 targeting x86 processors. A 700MHz Crusoe CPU ran x86 applications at about the same speed as a 500MHz Pentium III CPU. However, it was less expensive and used less power than the Pentium III.

Transmeta created a second generation of CPUs released in 2004, named Efficeon, that also had an x86 CMS. The Efficeon CMS layer more closely tracked the Pentium 4 rather than the Pentium III, as Crusoe did. It supported an integrated memory controller and a HyperTransport I/O bus. The processor started at a speed of 1.2GHz. Although test results are hard to come by, it is thought that its performance was a bit less than a Pentium M, but it was assumed to have used less power.

One of the first companies to utilize the Transmeta CPUs was Orion Multisystems, which in April 2004 introduced two multinode desktop/deskside systems. In a presentation, Orion clearly stated that these systems were designed for the individual engineer, not as a shared system.

The first system, the DT-12, was a true desktop system with 12 “nodes” on a single special motherboard. Each node was a Transmeta Efficeon processor connected to the other nodes over Gigabit Ethernet (GigE). Each node also had 512MB of memory, and the system had a 160GB hard drive (which was quite a bit of storage in 2004). The system cost between $20,000 and $30,000. As with other desktop systems, the DT-12 plugged into standard 120V outlets and used less than 200W of power. Orion also introduced a 96-node deskside system that used 1,500W of power in a chassis with eight DT-12 motherboards (Figure 1) for about $170,000 fully configured. Unfortunately, Orion shut down operations in February 2006.

Figure 1: Orion Multisystems desktop/deskside supercomputer (Joel Adams, Calvin College).

TyanPSC

Another key example of a desktop/deskside supercomputer is the TyanPSC Typhoon Personal Supercomputer (PSC), which was launched at the Computex Taipei show in June 2006. It used a “blade” configuration, wherein commodity motherboards slid into and out of a chassis (Figure 2). Notice that this first version of the TyanPSC, the Typhoon 600, had four motherboards, each in its own blade, in a deskside chassis on wheels. The chassis was 14 × 12.6 × 26.7 inches. Although it sounds compact, for a deskside system, it was large.

Figure 2: TyanPSC Typhoon (from Tyan FTP site). The rollers on the bottom of the chassis helped to move it around.

The initial Typhoon 600 blades used Intel processors. Each blade had two sockets, so you could get as many as 16 cores in the original system with dual-core processors, as well as a single SATA drive per blade. It also supported up to 64GB of system memory, and the chassis had integrated networking with nine GigE ports: two for each blade and one for connecting externally. In total, the system was designed with a power limit of 1,400W (four 350W power supplies plugged into standard 120V wall sockets). The Typhoon ran Linux, but it could also run Microsoft Windows Compute Cluster Server 2003, which was Microsoft’s push into supercomputing and smaller scale HPC.

At Supercomputing 2006 (SC06), Tyan announced the availability of quad-core Intel processors, allowing up to 32 cores per system, at a starting price of less than $15,000. They also announced InfiniBand capability. Tyan further developed the PSC with Opteron blades and added x16 PCIe slots on the blades to add a high-performance GPU for visualization.

Cray CX1

The Cray CX1 deskside system had a big impact on HPC. Cray, a leader in primarily large centralized supercomputing systems, released this deskside supercomputer (Figure 3) that focused on the individual in September 2008. They had taken all the knowledge and experience from their large systems and put it into this individual workstation.

Figure 3: Cray CX1 (image by permission of Digital Engineering 24/7 magazine).

The chassis supported up to eight blades, with each blade a dual-socket, Intel-based motherboard with Xeon 5400, 5500, or 5600 processors with two, four, or six cores. As with the previous systems, you could plug it into normal home or office outlets. It was designed to use a 20A circuit that provided up to 1,920W, but it only used a maximum of 1,600W.

The blades had up to eight DDR3 DIMM slots along with two 2.5-inch SATA hard drives (HDDs) and a single x16 PCIe port. You could have a one-slot blade with four connected 3.5-inch HDDs that connected to other blades by the PCIe x16 slot, but this took the place of an adjacent compute blade. As with the previous two desktop/deskside systems, it had built-in GigE or DDR InfiniBand networking.

Cray also introduced a visualization or GPU compute blade that would connect to another blade through the PCIe slot. You could use an Nvidia Quadro GPU for visualization, or you could connect up to four external S1070 GPU compute chassis that housed up to four C1600 GPUs.

The Cray CX-1 had some additional unique features, such as an active noise cancellation system and an integrated touchscreen control panel from which you could control each of the blades or check power consumption, processor temperatures, chassis fan speeds, and other system metrics. One of the interesting features of the Cray CX-1 was that, like the TyanPSC Typhoon, it was designed to run either Red Hat Enterprise Linux (RHEL) or Microsoft Windows. This was also during the time that Microsoft was making a push into the HPC world.

Why These Three Systems?

These three systems are important for different reasons. The Orion Multisystems desktop supercomputer was dead-on-target for a growing market of systems that provided more than the processing power of a single- or dual-socket motherboard in a power-efficient case. It combined a very innovative, low-power processor with a custom-built system board and chassis; however, the combination of a new, untried processor with custom boards and chassis at a high starting price ($20,000) did not allow the system to succeed.

The TyanPSC Typhoon tried to reduce pricing by using commodity two-socket motherboards and processors with the chassis the only custom feature. It broke ground in that it could run Microsoft Windows Compute Cluster Server 2003 in addition to Linux. The thought was that by running Windows, it would appeal to a larger audience. The hope in using commodity components is that it would reduce costs. It did this relative to the Orion Multisystems to the point that the introductory price was around $15,000. However, it did not sell as well as hoped, partly because it was still too expensive.

The third system was remarkably interesting because it was built by a major supercomputing company with a massive amount of HPC experience. The Cray CX-1 was a very well built system that used all custom components, except the processors, which allowed Cray to provide a completely integrated solution. It was also innovative in that it offered GPUs either for visualization or computing. Interestingly, the starting price was $25,000, which was not that much more than the TyanPSC, especially considering it was virtually a completely custom setup. However, the CX-1 did not survive too long, again, mostly because of the price.

Present

Since the Cray CX-1, companies have offered two- and four-socket deskside workstations that can be used to run supercomputing applications. However, the early “four-way” (four-socket) systems tended to be expensive, and their cache snoop traffic could occupy a significant portion of interprocessor communication. Today's four-ways have gotten better, but they have lots of DIMM memory and storage drive capability and are not really designed for HPC. Rather, they were designed for something else like databases or graphics processing.

NVIDIA DGX Station A100

Among large workstations, Nvidia has recently upped the ante on by offering a deskside solution with very high performance GPUs that focuses on artificial intelligence (AI) and HPC. AI applications have had a major effect on HPC workloads because they require massive amounts of compute to train the models. Additionally, a large percentage of AI workloads, particularly for deep learning (DL) training, is accomplished with interactive Jupyter notebooks.

As the demand for AI applications grows, more and more people are developing and training models. Learning how to do this is a perfect workload for a workstation that the user controls and that can easily run interactive notebooks. These systems are also powerful enough for training what are now considered small to medium-sized models.

Just a few years ago, Nvidia introduced an AI workstation that used their powerful V100 GPUs. The DGX Station had a 20-core Intel processor, 256GB of memory, three 1.92TB NVMe solid-state drives (SSDs), and four very powerful Tesla V100 GPUs, each with 16 or 32GB. It used a maximum of 1,500W (20A circuit), and you could plug it into a standard 120V outlet. It had an additional graphics card, so you could plug in a monitor. This system sold for $69,000.

The DGX Station has now been replaced with the DGX Station A100 that includes the Nvidia A100 Tensor Core GPU. It now uses an AMD 7742 CPU with 64 cores, along with 512GB of memory, 7.68TB of cache (NVMe drives), a video GPU for connecting to a monitor, and four A100 GPUs, each with 40 or 80GB of memory. This deskside system also uses up to 1,500W and standard power outlets (120V). The system price is $199,000.

These deskside systems are a kind of pinnacle of desktop/deskside supercomputers that use standard power outlets while providing a huge amount of processing capability for individuals; however, they are pushing the power envelope of standard 120V home and office power. Although powerful, they are also out of the price range of the typical HPC user.

The DGX Station A100 represents an important development because right now, several AI-specific processors are being developed, including Google’s tensor processing unit (TPU) and field-programmable gate arrays (FPGAs). If any of these processors gain a foothold in the market, putting them in workstations is an inexpensive way to get them to developers.

SBC Clusters

At the other end of the spectrum from the DGX Station A100 are single-board computers (SBCs), which have become a phenomenon in computing courtesy of the Raspberry Pi. SBCs contain a processor (CPU), memory, graphics processing (usually integrated with the CPU), network ports, and possibly some other I/O capability. Inexpensive SBCs are less than $35 and most of the time cost less than $100; some SBCs that cost only $10 even include a quad-core processor, an Ethernet port, and WiFi. Moreover, the general purpose I/O (GPIO) pins that many of these systems offer allow the SBCs to be expanded with additional features, such as NVMe drives, RAID cards, and a myriad of sensors.

Lately, SBCs have started to give low-end desktops a run for their money, but an important feature is always low power, much lower than any other desktop. For example, the Raspberry Pi 4 Model B under extreme load only uses about 6.4W.

The CPU types for these SBCs are mostly ARM architecture, but some use x86 or other processors. Many are 64-bit, and some have a reasonable amount of memory (e.g., 8GB or more). Others have very high performance GPUs for the low-power envelope, such as the Nvidia Jetson Nano. As you can imagine, given the low cost and low power, people have built clusters from SBCs. Like many other people, I built a cluster of Raspberry Pi 2 modules (Figure 4).

Figure 4: My Raspberry Pi clusterminus the head node,which is external to the “tower.”

PicoCluster LLC has taken home-built SBC clusters, effectively Beowulf clusters, to the next level: They build cases along with ancillary hardware from a variety of SBCs. Their Starter Kit includes a custom case for the SBCs, power, cooling fans, and networking. However, the kit does not include the SBCs and their SD cards. An Advanced Kit builds on the Starter Kit but adds the SBCs. Finally, an Assembled Cube is fully equipped with everything – SBCs, SD cards – and is burned in for four hours. This kind of system is perfect for a desktop supercomputer (Figure 5). Under load, the Jetson Nano uses a maximum of about 10W, so the Pico 10H uses a bit over 100W (including the switch and other small components).

Figure 5: PicoCluster 10H cluster of 10 Jetson Nano SBCs in an assembled cube (by permission ofPicoCluster LLC for this article).

An add-on board (Cluster HAT; hardware attached on top) to a standard Raspberry Pi allows you to add up to four Raspberry Pi Zero boards to create a very small cluster. A while ago, I wrote an article about the first-generation Cluster HAT (Figure 6). Newer versions of the Cluster HAT allow you to scale to a larger number of Cluster HATs per Raspberry Pi.

Figure 6: My ClusterHAT setup.

Limulus

Doug Eadline, a luminary in the Beowulf community, has applied the Beowulf principles to desktop supercomputing with a system he calls Limulus. Limulus takes consumer-grade standard motherboards, processors, memory, drives, and cases and creates a deskside Beowulf. To do this, he has created mounting brackets with specific 3D-printed components to create blades with micro-ATX motherboards that are mounted in a standard case (Figure 7).

Figure 7: Limulus personal workstation (used with permission of Limulus Computing).

The Limulus systems also include built-in internal networking ranging from 1 to 25GigE and can include various storage solutions, including the stateless compute nodes (i.e., no local storage). Warewulf, a computer cluster implementation toolkit, boots the cluster, and Eadline has coupled it with the Slurm scheduler so that if no jobs are in the queue, the compute nodes are turned off. Conversely, with jobs in the queue, compute nodes that are turned off will be powered on and used to run jobs.

Limulus comes in different flavors, and an HPC-focused configuration targets individual users, although a larger version is appropriate to support small workgroups. A Hadoop/Spark configuration comes with more drives, primarily spinning disks; a deep learning configuration (edge computing) can include up to two GPUs.

Limulus systems range in size and capability. You can differentiate them by the computer case. The thin case (Figure 7) usually accommodates four motherboards, whereas the large case can accommodate up to eight motherboards (Figure 8). The Limulus motherboards are single-socket systems typically running lower power processors. You can pick from Intel or AMD processors, depending upon your needs.

Figure 8: Limulus workgroup workstation (used with permission of Limulus Computing).

Limulus is very power efficient and plugs into a standard wall socket (120V). Depending on the configuration, the system can have as little as a few hundred watts or up to 1,500W for systems with eight nodes and two GPUs.

A true personal workstation with four motherboards, 24 cores, 64GB of memory, and 48TB of main storage costs just under $5,000. The largest deskside system has 64 cores, 1TB of memory, 64TB of SSD storage, and 140TB of hard drive space and goes for less than $20,000. Compare these two prices to the prices of the three previous desktop supercomputers discussed.

Future

Pontificating on the future is always a difficult task. Most of the time, I would be willing to bet that the prediction will be wrong, and I’d probably win many of those bets (over half). Nonetheless, I’ll not let that prevent me from having fun.

Supercomputing Desktop Demand Will Continue to Grow

One safe prediction is that the need for desktop supercomputers will continue to grow. Regardless of the technology, supercomputing systems grow and end up becoming centralized, shared resources. The argument for this is that it theoretically costs less when everything is consolidated (economy of scale). Plus, it does allow the handful of applications that truly need the massive scale of a TOP500 system to access it.

However, centralized systems take away compute resources from the user’s hands, and the user must operate asynchronously, submitting their job to a centralized resource manager and waiting for the results. All the jobs are then queued according to some sort of priority. The queue ordering is subject to competing forces that are not under the control of the user. A simple, single-node job that takes less than five minutes to run could get pushed down the queue to a point that it takes days to run, and interactive applications would be impossible to run.

While some applications definitely need very large systems, surprisingly, the number of applications that need one to four nodes dominate the workload. Having the ability to run an application when needed is valuable for users, and this is what desktop supercomputers do. They allow a large group of users to run jobs when they need to (interactive), rather than being stuck in the queue. Depending on the size of user jobs, desktop systems can also serve as the only resource for specific users.

For those applications and users that need extremely large systems, desktop supercomputers are also advantageous because they take over the small node count jobs, moving them off the centralized system. The sum of the resources freed up for the larger jobs could be quite extensive (as always, it depends).

Cloud-Based Personal Clusters

The cloud offers possibilities for desktop supercomputers, albeit in someone else’s data center. The process of configuring a cluster in the cloud is basically the same for all major cloud providers. The first step is to create a head node image. It need only happen once:

  1. Start up a head node Linux virtual machine (VM), probably with a desktop such as Xfce. If needed, this VM can include a GPU for building graphics applications. If so, the GPU tools can be installed into the VM.
  2. The head node should then have cluster tools installed. A good example is OpenFlight HPC, which has environment modules, many HPC tools, a job scheduler, and a way to build and install new applications. You could even install Open OnDemand for a GUI experience.
  3. The head node can also host data and share it with the compute nodes over something like NFS. If the cloud provider has a managed NFS service, then this need not be included with the head node. However, if more performance than NFS is needed, a parallel filesystem such as Lustre or BeeGFS can be configured outside of the head node.
  4. At this point, you can save the head node image for later use.

A user then can start up a head node when needed with this saved image and a simple command from the command line or called from a script. The user then logs in to the head node and submits a job to the scheduler. Tools such as OpenFlightHPC will start up compute nodes as they are needed. For example, if the job needs four compute nodes and none are running, then four compute nodes are started. Once the compute nodes are ready to run applications, the scheduler runs the job(s). If the compute nodes are idle for a period of time, the head node can spin down the compute node instances, saving money.

Many of the large cloud providers have the concept of preemptible instances (VMs) that can be taken from you, interrupting your use of that node. This results when another customer is willing to pay more for the instance. In the world of AWS, these are called Spot Instances. Although the use of Spot Instances for compute nodes can save a great deal of money compared with on-demand instances, the most expensive instance type, the user must be willing to have their applications interrupted and possibly restarted.

After the job is finished, if no jobs are waiting to be run, the compute nodes are shutdown. If the head node is not needed either, it too can be shut down, but before doing so, any new data should be saved to the cloud or copied back to the user’s desktop system, or the head node image can be updated.

Of course, you will recognize the need for additional infrastructure and tools to make this process easy and to ensure that the user has limits to how much money can be spent on cloud resources.

Summary

Today’s supercomputers are following the same path that past supercomputers followed: toward massive, centralized, shared resources. Users must create job scripts, submit them to a resource manager, and wait for the applications to run. Although some applications certainly can use an entire TOP500 system (or more) for a single application, a number of applications and a great deal of research do not need that level of computer power and can get by with just a few nodes or interactive applications. Interactivity is becoming extremely important as AI application usage grows. Rather than have these applications sit in a job queue, why not run them on a desktop supercomputer that is controlled by the user?

Desktop supercomputers achieve two things. First, they give individual users more compute power to run applications locally, under their control, whenever they want. They can be used for application development, testing, pre- and postprocessing of data, applications and problems that do not require large core counts or memory capacity, interactive applications that are extremely popular because of Jupyter notebooks, and a myriad of other cases. More power to the user! Second, users can remove small node count jobs to desktop supercomputers from the large centralized systems, which collectively can result in more time for the very large scale applications that need the entire system or a significant part of it.

Although past attempts have been made at desktop/deskside supercomputers, a combination of factors and events did not allow them to succeed. A key reason was the prohibitive cost of these systems, limiting the market. Current desktop supercomputers such as Limulus are less than a quarter the cost of the least expensive past desktop supercomputer. Moreover, Limulus allows you to turn nodes on and off as needed, making it very energy friendly. If a user does not need absolute performance, then an SBC cluster is very affordable and requires very low power. A simple Cluster HAT system can be carried in your bag and uses less than 10W.

Given that the desire for more compute power in the hands of the user is growing exponentially, driven by Jupyter-style notebooks and AI, it is easy to see why desktop supercomputers are becoming more popular.