A Brief History of Supercomputers

This first article of a series looks at the forces that have driven desktop supercomputing, beginning with the history of PC and supercomputing processors through the 1990s into the early 2000s.

Computing originally comprised centralized systems to which you submitted your deck of punched cards constituting your code and data; then, you waited for your output, which was likely printed by a huge dot matrix printer on wide green and beige paper. Everyone took turns submitting their applications and data, which were run one after the other – first in, first out (FIFO). When your code and data were read in, you did not need to be there; however, you often had to stick around to submit your card deck when the system was available.

In the latter 1980s and into the 1990s, you likely had a monitor where you were able to input your programs that were saved on mass storage devices. Often, they were dedicated front-end systems that accommodated the users. Anyone that wanted to use the system logged into this front-end system, created their code and data, and then submitted a “job” to run the application with the data. The job was really a script with information about the resources needed: how many CPUs, how much memory, how to run your application (the command), and so on. If the hardware resources could be found, your program was executed, and the results were returned to your account. Notice that you did not have to be logged in to the system for the job to be executed. The job scheduler did everything for you.

The job scheduler could launch multiple jobs to use as many of the resources as possible. It also kept of list of the next jobs to run. How this list was created was defined by policies and could be very sophisticated, so the best utilization of the system resources was made.

Early high-performance computing (HPC) systems were large centralized resources shared by everyone who wanted to use them. You had to wait in line for the resources you needed. Moreover, the resources were not interactive, making it more difficult to write code. Your applications were run, and the results were returned to you. In other words, your work was at the mercy of what everyone else was running and the resources available. Moreover, preprocessing or postprocessing your data on these systems was impossible. For researchers, engineers, scientists, and other HPC users, this workflow was a mess. You had spurts of activity and lots of down time. During college, we used to make basket goals from punch cards and small basketballs from old printouts and tape and shoot goals waiting for our code to execute.

Before and during grad school I used the centralized university supercomputers starting in 1980s with the university’s CDC 6500. Then I moved to the Cyber 205 and ETA 10. Initially, I got almost unlimited processing time because people had not really “discovered” them yet. I learned vector processor on the Cyber 205, which was my first exposure to parallelism (vector parallelism). While I was working on my research, though, the HPC systems were finally discovered by everyone else at the university, so they became heavily used, resulting in the dreaded centralized and tightly controlled resources. Very quickly I lost almost all my time on the systems. What I did to finish my graduate research is a story for another day. Let me just say, thank you Sun workstations.

The point to this long-winded introduction is that researchers increasingly needed HPC resources, especially as applications needed more compute cycles, memory, storage, and interactivity. Although the HPC systems of the time had a great deal of horsepower, they could not meet all of the demands. Moreover, they were very expensive, so they became tightly controlled, centralized, shared resources. This inhibited the growth in computational research – the proverbial 10 pounds of flour in a five-pound bag. However, in this case, it was more like 50 pounds of flour. The large centralized HPC resources were needed for larger scale applications, but what was really needed were smaller HPC systems that were controlled by the researchers themselves. Desktop supercomputing, if you will.

My interest in desktop supercomputers (also called desktop clusters) accelerated with Beowulf clusters, where anything seemed possible. My first desktop cluster was at Lockheed-Martin where I gathered up desktop PCs that were tagged for recycling and made a cluster in my cubicle. It was not exactly on my desktop, but it was close enough. Having a system physically on your desktop, or right beside it, where you can write code, create, and test machine learning (ML) and deep learning (DL) models, do visual pre- and postprocessing, and have all the computational power at your fingertips without having to share it with hundreds or thousands of others is very appealing. Of course, it is not a centralized supercomputer where you can scale to huge numbers of processors or extreme amounts of memory, but you can do a massive amount of computing right there. Plus, you can get your applications ready for the centralized supercomputer on your desktop supercomputer.

From my perspective, four things are driving, and have driven, the case for desktop supercomputers:

  • Commodity processors and networking
  • Open source software
  • Linux
  • Beowulf clusters

This article focuses on a bit of the history of supercomputer processors and PC processors and commodity networking. I think that the rise of commodity processors and networking is a huge contributor to desktop supercomputing. Therefore, it is worth revisiting the history of processors and networking through the 1990s.

Early Supercomputers

Supercomputers in the mid-1980s to early 1990s were dominated by Cray. In 1988, Cray Research introduced the Cray Y-MP, which had up to eight 32-bit vector processors running at 167MHz. It had options for 128, 256, or 512MB of SRAM main memory and was the first supercomputer to sustain greater than 1GFLOPS (10^9 floating point operations per second).

That supercomputer was expensive. A predecessor to the Y-MP, the X-MP, sold to a nuclear research center in West Germany for $11.4 million in 1981, or $32.6 million in 2020 dollars (see The Supermen by Charles J. Murray, Wiley, 1997, p. 174).

Although Cray may have dominated the supercomputing industry coming into the 1990s, they were not alone. NEC had a line of vector supercomputers named SX. The first two NEC models, the SX-1 and SX-2, were launched in 1985. Both systems had up to 256MB of main memory. The SX-2 was reportedly the first supercomputer to exceed 1GFLOPS. It had four sets of high-performance vector operation pipelines with up to a maximum of 16 arithmetic units capable of multiple/parallel operation. The NEC SX-1 had about the half the performance of the SX-2 and was presumably less expensive.

Around this time, a number of massively parallel computers came out, including Thinking Machines (think Jurassic Park), nCUBEMeiko ScientificKendall Square Research (KSR), and MasPar. Some of the massively parallel computers had systems into the early 1990s. These systems used a range of ideas and technologies to achieve high performance. Thinking Machines used 65,536 one-bit processors in a hypercube, later adding Weitek 3132 floating-point units (FPUs) and even RAID storage. The final Thinking Machines system, the CM-5E, used Sun SuperSPARC processors. Meiko Scientific Ltd. used transputers and focused on parallel computing. The systems started with the 32-bit INMOS T414 transputers in 1986. Later it switched to Sun SuperSPARC and hyperSPARC processors.

Both Thinking Machines and Meiko survived into the 1990s, and other companies such as KSR and MasPar sold systems into the early 1990s. These companies were especially important to the future of HPC because they showed that using large numbers of processors in a distributed architecture could achieve great performance. They also illustrated that to get to this performance level required a great deal of coding, so software took on a much more important role than before.

No longer did you have to rely on faster processors to get better performance by recompiling your code or making a few minor tweaks. Now, you could use lots of simple computing elements combined with lots of software development to achieve great performance. There was more than one path to efficient HPC performance.

The Intel-based PC processors of those early years were not as advanced as supercomputer processors. However, supercomputers were being challenged by massively parallel systems, and many times the manufacturers were creating their own processors; however, almost all eventually switched to workstation processors. This switch in processors was driven by cost. Developing new, very parallel systems, as well as new processors, is very expensive. As you will see, this will have some effect on HPC system trends in the 1990s. For now, PC CPUs were not in the HPC class, and any possible systems built from them were not scalable because PC networking really did not exist.

Supercomputer Processors in the Early 1990s

Cray entered the 1990s with the Cray Y-MP that had up to eight vector processors running at 167MHz, a much higher clock speed than PC CPUs. However, like the i486, it used 32-bit processors, limiting the addressable memory to 4GB. In the early 1990s, Cray launched some new systems. In 1991, the Cray C90, a development of the Cray Y-MP, was launched. It had a dual-vector pipeline where the Y-MP only had a single pipeline. The clock speed was also increased to 244MHz, resulting in three times the performance of the Y-MP. The maximum number of processors increased from eight to 16. Note that these processors were still designed and used only by Cray.

Cray’s last major new vector processing system, the T90, first shipped in 1995. The processors were an evolution of the those in the C90 but with a much higher clock speed of 450MHz. The number of processors also doubled to 32. The system was not inexpensive with a 32-processor T932 costing $35 million ($59.76 million in 2020 dollars).

Cray launched the T3D in 1994. This was an important system for Cray because it was their first massively parallel supercomputer. It used a 3D Torus network to connect all the processing elements, hence the name T3D. It integrated from 32 to 2,048 processing elements (PEs), where each PE was a 64-bit DEC Alpha 21064 RISC chip. The chip had its own memory area, memory controller, and prefetch queue. The PEs were grouped in pairs or nodes of six chips.

The T3D had distributed memory, each PE with its own memory, but it was all globally addressable with a maximum of 8GB of memory in total. The T3D used a “front-end” system to provide things such as I/O functionality. Examples of front-end systems were the Cray C90 or Y-MP.

The T3D was something of a sea change for Cray. The first reason is that it moved away from vector CPUs and focused more on massively parallel systems (i.e., lots of processing elements). Second, it’s a shift from Cray-designed processors to those from another company – in this case, DEC. Whether this was a move to reduce costs or improve performance is known only to Cray, but it appears to be an effort to reduce costs. A third reason is a move to using a 3D Torus network.

The Cray T3E was a follow-on to the T3D, keeping the massively parallel architecture. Launched in late 1996, it continued to use the 3D Torus network of the T3D but switched to the DEC Alpha 21164 processor. The initial processor speed was 300MHz. Future processors used 450, 600, and even 675MHz. Similar to the T3D, the T3E could scale from 8 to 2,176 PEs, and each PE had between 64MB and 2GB of memory. The T3D and the T3E were arguably highly successful systems. A 1,480-processor system was the first system on the TOP500 to top 1TFLOPS(10^12FLOPS) running a scientific application.

Cray did not just develop multiprocessor systems with DEC Alpha processors. It continued the development of vector systems based on the Cray Y-MP. Recall that the Y-MP was launched in 1988 with up to eight 32-bit vector processors running at 167MHz The Cray J90 was developed from the Cray Y-MP EL (entry level) model. It was hoped to develop a less expensive version of the Y-MP by using air cooling. The system supported up to four processors and 32MB of DRAM memory. The J90 supported up to 32 vector processors at 100MHz and up to 4GB of main memory. Each processor was two chips: one for the scalar portion of the architecture and the second for the vector portion of the architecture.

NEC was also launching new SX systems in the early 1990s. In 1990, it launched the SX-3 system that allowed parallel computing, permitting both SIMD (single instruction, multiple data) and MIMD (multiple instruction, multiple data) operations. It had up to four arithmetic processors with up to four sharing the same main memory and up to several processors in the system. The NEC SX-4 system was announced in 1994 and first shipped in 1995. It arranged several CPUs into a parallel vector processing node; then, these nodes were installed into a regular symmetric multiprocessing (SMP) arrangement.

Processor Trajectory

Both the Intel 486 and Pentium PC processors had about the same performance as Sun-3 and SPARCstation-1 workstations that were proving to be so popular. Comparatively, the supercomputers were still definitely much faster than those in a PC, but supercomputer manufacturers knew that making their own processors was becoming a burden, forcing their prices to stay high.

On the other hand, Intel was improving their processors quickly with faster versions almost every year, and they fit into the same socket, saving customers money. In the span of five years, they came out with three new processors with increasing speeds and sophistication – all of them affordable by millions of people.

In the early part of the 1990s the pace of PC CPU development was quick, and the quantities of PC processors sold were much larger than supercomputer processors and growing rapidly. However, they were still behind in terms of performance. Despite Cray systems moving from custom processors to workstation processors, they did not have nearly the volume of sales of PC processors.

The latter half of the 1990s was a very hectic time for PC CPUs. Intel launched the Pentium Pro in November 1995, with 5.5 million transistors, a big jump from the 3.3 million transistors in the Pentium. The clock speed started at 150MHz, but subsequent versions rose to 166 and 200MHz The FSB speeds ranged from 60 to 66MHz. The Pentium Pro had a large on-package L2 cache for a PC CPU, with the first versions starting at 256KB, then increasing to 512KB and ultimately 1MB.

The Pentium Pro was a superscalar processor and supported MMX, as well, which added more performance to the CPU. It also had out-of-order execution to make it more efficient. The Pentium Pro had a 36-bit address bus that was usable by physical address extension (PAE), allowing access to 64GB of memory. The Pentium Pro could be used in dual- and even quad-socket configurations, creating SMP solutions. The socket was the same for all sockets on the board.

Perhaps one of the most important points about the Pentium Pro is that it was the first PC CPU to be used in a supercomputer, ASCI Red. ASCI Red was the fastest supercomputer until late 2000 and was the first supercomputer to reach 1TFLOPS, achieving 1.06TFLOPS in December 1996.

PC processors continued to improve throughout the 1990s, with rapid escalation of processing power and the emergence of AMD as a competitor to Intel. At the close of the decade, the pace of innovation and release of new PC processors was staggering. A processor was released almost every year or 18 months, and sometimes two processors were released in a year. PC processors were steadily increasing in clock speed and L2 cache size. CPUs became superscalar and added SIMD instructions, providing more performance for applications that could use it. However, despite all these developments, PC processors were still 32-bit.

Supercomputer Processors in the Latter 1990s

Recall that Cray launched the T3E in 1996. In February 1996, Cray was acquired by SGI. While SGI owned Cray, only one new Cray model line was released, the Cray SV1 in 1998. It went back to the Cray vector systems using processors made by Cray and was backward compatible with J90 and Y-MP software. Unlike previous processors, though, those in the SV1 included a vector cache, a departure from earlier designs. It ran at 300MHz, but later variants ran at 500MHz. The SV1 node design was something like an SMP system in the fashion of the J90. These nodes could be connected to create a clustered SMP vector system.

NEC launched its SX-5 system in 1998. It could reach 4TFLOPS of performance. Each node of the system used 16 CPUs with up to 128GB of main memory. It could connect up to 32 nodes for a total of 4,096GB of memory.

SGI announced a large-scale HPC system, the SGI Origin 2000, in 1996. It could accommodate from 2 to 128 MIPS CPUs. These were the MIPS R10000 processors that ran at 180MHz initially, but then increased to 300 and 400MHz. One advantage SGI had was that the processors weren't just used for supercomputers like the Origin 2000. They were also used for SGI workstations. The SGI O2 workstation was introduced in 1996 and used a single R10000 MIPS processor than ran from 150 to 400MHz.

The classic line of supercomputers – Cray, NEC, and now SGI – made good progress in the 1990s. In some cases, they used more commodity processors such as the DEC Alpha in Cray systems or even the MIPS in the SGI Origin.

Overall, supercomputer processors also had some good clock speed gains during this time. In terms of architecture, Cray and NEC had turned toward distributed parallel systems connected with a high-performance network. The SGI systems were SMP, and both Cray and NEC had parallel SMP systems.

Although these systems were dominant in HPC, they were still awfully expensive, running into the millions. As such, they were very centralized and shared across a wide range of users. Companies and research institutions could only afford one system that was centralized and tightly controlled. Moreover, because the systems were quite costly, only certain applications could run on them. If you will, they were still very much the model of a priesthood, where the lowly user had to ask permission to run an application.

The 1990s were critical to both “classic” supercomputers and PC systems. Both had advanced very quickly, setting a trajectory into the early 2000s.

PC Processors in the Early 2000s

In early 2001, Intel Pentium III clock speed achieved greater than 1GHz, hitting a maximum of 1.4GHz. It had a large 512KB L2 cache and started at a clock speed of 1.3GHz. However, it was still a 32-bit processor.

In September 2003, AMD released their all-new Athlon 64 processor, which was the first 64-bit PC CPU. Even better, it was backward compatible with the 32-bit x86 instructions and got away from the old FSB architecture, introducing a point-to-point link that was low latency and high bandwidth: HyperTransport. The processor started with 1GHz clock speed and 800MHz HyperTransport speed.

In addition to the HyperTransport bus architecture change, the Athlon 64 incorporated an on-die memory controller, which meant the memory controller was part of the processor itself. The Athlon 64 also supported several instruction sets, including MMX, SSE, SSE2, SSE3, x86-64, and 3DNow! Just about any application in the previous generations that used a specialized instruction set could now run. Importantly, PCs now had a processor with the same precision as supercomputing processors: 64-bit.

The next-generation Intel processor, the Pentium 4, launched in later 2000, started off as a 32-bit processor, but in 2004 Intel released a version of the Pentium 4 (code-named Prescott) that was 64-bit.

In 2005, AMD introduced the Athlon 64 X2 with two separate and complete cores in a single die. It also improved the clock speed up to 3.2GHz and increased the L2 cache up to 1MB per core.

Intel released its first dual-core processor, the Pentium dual-core, in 2006. However, it was still using the FSB architecture. Intel would not have a point-to-point link in its processor line until the Nehalem Clarkdale in 2010. Remember that all these processors could easily be bought, and were bought, by everyday people.

Supercomputer Processors in the Early 2000s

In the early 2000s, SGI released the Origin 3000 with a new version of the MIPS R10000 processor. The R12000 was an improved R10000 and ran up to 360MHz. The R14000 was then introduced into the Origin 3000 as an improved R12000, with up to 500MHz clock speed.

Cray introduced the Cray X1 in 2003. The processor shared the streaming processors and vector caches of the Cray SV1. Each processor ran up to 800MHz. The system could be configured with up to 4,096 processors where there were 1,024 shared memory nodes. Later, in 2005, Cray released the X1E upgrade that used dual-core processors running at 1,150MHz.

NEC launched the SX-6 supercomputer in 2001. A single node had up to eight vector processors and up to 64GB of memory. You could connect up to 128 nodes in a single system. However, NEC created a special version called the Earth Simulator that had 640 nodes. The Earth Simulator was the fastest supercomputer for a considerable time.

Trajectory

Of course, clock speeds are not the best approach for comparison, but in the absence of any benchmarks that lasted for more than 15 years during that time period, it will have to serve as a guide. Although absolute clock speed numbers are not important, what is relevant is the growth in clock speeds, as well as the relative values.

From the 1990s through the early 2000s, you can see the trajectory of CPUs. The PC CPUs were very quickly gaining in performance (e.g., adding SIMD instructions). Finally, they became 64-bit in 2003 with the AMD Athlon 64. Clock speeds were also quickly increasing to well over 1GHz and on to 3GHz. Then in 2005, the Athlon 64 X2 introduced multiple cores in a single die that ran at 1.9MHz.

At the same time, supercomputer processors, which were made in much, much smaller quantities, still had lower clock speeds. Cray was using vector processors that could run vectorizable code extremely fast, but even then, the clock speeds barely reached 1GHz in 2005, when PC CPUs were approached 2GHz. The SGI MIPS processors, which were also made for the workstation market, were still under 1GHz when the Origin 3000 launched in 2000.

During the 1990s, the pace of PC CPU development was quickening, with good increases in clock speed and increasing parallelism. The L2 cache was also increasing in capacity over time. Then, in 2003, PC CPUs reached 64-bit with high clock speeds, quickly followed by two cores on a die with clock speeds of 2GHz and greater.

Supercomputers enjoyed a great period of growth in the early 1990s, with better clock speeds, great vectorization, and even additional parallelism across nodes. The early experiments of the late 1980s and early 1990s showed that parallelism from large numbers of processors was possible, although software had challenges trying to take advantage of all that processing.

At the same time, Cray was only making a small number of processors compared with the PC market. Large investments were spread across the development of a small number of processors. However, Cray also used workstation processors, specifically DEC Alpha processors, to reduce costs while still maintaining great performance, as reflected in the popularity of the Cray T3D and T3E systems.

SGI also tried using their MIPS processors in both their workstations and Origin supercomputers to help keep system prices down, making them competitive with Cray.

Overall system performance for these supercomputers was increasingly driven by parallel processing across multiple nodes. The PC processors were very quickly catching up and surpassing supercomputer processors. Tables 1 and 2 show a brief glimpse of the trajectory of PC CPUs and supercomputer processors from the late 1980s into the early 2000s.

Table 1: PC Processor Progression 

Date Processor Highlights
Apr 1989 486DX On-die L1 cache, much better performance than 386L2 on motherboard
Mar 1992 i486DX2 2:1 clock multiplier, 40/20, 50/25, 66/33 speeds; L2 on MB
Mar 1994 i486DX4 3:1 clock multiplier, 75/25, 100/33 speeds; 16KB L1 cache on-die, L2 on motherboard
Mar 1993 Pentium Data bus width doubled to 64 bits, superscalar, FSB of 60-66MHz, clock multiplier of 1; 16–32KiB L1, still external L2 cache
Nov 1995 Pentium Pro 150–200MHz on-package L2 cache (256KB to 1MB); decoupled, superscalar, 14-stage super-pipelined, out-of-order execution, two integer units
Jan 1997 Pentium MMX SIMD (MMX), 166–200MHz
Apr 1997 AMD K6 Supports MMX, 166–300MHz; L1 cache 32+32KB, L2 on motherboard
May 1997 Pentium II Improved Pentium Pro, first Xeon naming, 233–450MHz
May 1988 AMD K6-2 MMX and 3DNOW! SIMD, 200–570MHz; 64KiB L1 cache
Jun 1998 Pentium II Xeon SIMD; L2 cache from 512KB to 2MB
Feb 1999 Pentium III 9.5 million transistors, 450 and 500MHz clock speeds (600MHz in 1999); new SIMD, SSE, introduced; achieved 1GHz in early 2001; max. clock speed of 1.3GHz
Feb 1999 AMD K6-III 400 and 450MHz initial clock speed, ending at 500MHz; L2 cache of 256KB; Socket 7; MMX and 3DNOW! SIMD instructions
Jun 1999 AMD Athlon 500–700MHz
Nov 2000 Pentium 4 NetBurst architecture (not successful); introduced SSE2 (still used today); code could be fast but needed new code optimizations; eventually reached 3.8GHz
Early 2001 Pentium III ≥1.0GHz
May 2001 Xeon 32-bit; 1.4, 1.5, 1.7GHz
Sep 2001 Xeon 2.0–3.6GHz
Sep 2003 Athlon-64 1.0–3.2GHz
Feb 2005 Pentium 4F 64-bit, 2.8–3.8GHz
May 2005 Pentium D, Smithfield Dual-core, 2.66–3.2GHz
May 2005 Athlon 64 X2 Dual-core, 1.9–3.2GHz
Dec 2006 Xeon Clovertown Quad-core, 1.86–2.66GHz
Jan 2010 Nehalem Dual-core; 32+32 L1, 256KB L2, 3MB L3; 2.8GHz, two threads per core

Table 2: Supercomputer Processor Progression 

Date Processor Highlights
1985 NEC SX-1, SX-2 SX-2: four sets of high-performance vector operation pipelines with up to a maximum of 16 arithmetic units, capable of multiple/parallel operation
1988 Cray Y-MP Eight 32-bit vector processors, 167MHz, SRAM main memory single-vector pipeline
1990 NEC SX-3 SIMD, MIMD, four arithmetic processors, up to four sharing the same main memory
1991 Cray C90 Dual-vector pipeline, 244MHz, three times Y-MP performance
1994 Cray T3D DEC Alpha 21064 processors, 3D Torus, 64-bit
1994 Cray J90 Up to 32 vector processors, 100MHz, 4GB of memory, 32-processor T932 costing $59.76 million in 2020 dollars
1994 NEC SX-4 First shipped in 1995, several CPUs arranged into a parallel vector processing node; then, those nodes were installed into a regular SMP arrangement
1995 Cray T90 Evolution of C90, 450MHz processors
1996 Cray T3E DEC Alpha 21164 processor, 300MHz, future processors: 450, 600, and even 675MHz; can scale from 8 to 2,176 PEs, each PE 64MB and 2GB of memory
1996 SGI Origin 2000 R10000 MIPS processor, 180 to 300 and 400MHz
1998 Cray SV-1 Vector cache, 300MHz, later ran at 500MHz
1998 NEC SX-5 4TFLOPS, each node used 16 CPUs, up to 128GB memory
2001 NEC SX-6 Single node, up to eight vector processors, up to 64GB of memory, connect up to 128 nodes in a single system; became Earth Simulator
2003 Cray X1 NUMA, vector, 800MHz, eight-wide vector; air-cooled, up to 64 processors; liquid-cooled, 4,096 processors; 1,024 SMP nodes in 2D Torus; code with Python virtual machine (PVM) and message passing interface (MPI)
2004 SGI Origin 3000 R12000 MIPS processor, up to 360MHz; later R14000 up to 500MHz
2005 Cray X1E Dual-core, 1,150MHz

Commodity Networking

A critical aspect to making distributed computers work together is networking. When PCs were still in their infancy, specialized networks were awfully expensive and sometimes a little fragile. They were used for critical information transmission in industries such as Telco, finance, and government. Supercomputers through the 1990s used some of this specialized networking to achieve high bandwidth and low latency for that time.

For PCs, networking had to match PC pricing. You could not have a $500 to $2,000 PC with a $10,000 networking interface. The specialized networks did not match the low-cost expectation. PCs had to wait for cheaper networking to be developed. This came from Ethernet.

Ethernet, developed around 1973 and 1974, was developed at Xerox PARC, as were so many innovative technologies. Initially, Ethernet ran at 2.94Mbps and was used in several server applications, but not with PCs. In 1980, the Ethernet specification was upgraded to a 10Mbps protocol. Version 2 of the specification, known as Ethernet II, was published in November 1982. By the end of the 1980s, Ethernet had become the overall dominant network technology.

In the early 1980s, Ethernet used 10BASE5 and coaxial cable, which later changed to the 10BASE2 cabling many should remember (recall the “vampire taps”?). Then the world moved on to 10BASE-T, which used twisted-pair cables, as is still used today for common networking.

With 10BASE2 coaxial cabling, the use of Ethernet started to grow outside of supercomputers and specialized networks, bringing the prices of Ethernet, including Ethernet switches and routers, down, which caused more usage, and so on.

Around 1995, the next generation of Ethernet, Fast Ethernet, was introduced. This is probably the start of true commodity networking, with a performance of 100Mbps, 10 times faster than the previous generation. It was a quantum leap in performance, with prices dropping rapidly to the point where it became ubiquitous. The low prices allowed Fast Ethernet network interface cards (NICs), Ethernet switches, and Ethernet routers to be put into homes.

The first cluster I helped bring into Lockheed Martin used Fast Ethernet as the cluster interconnect. For the computational fluid dynamics applications, we used Fast Ethernet, which allowed code to scale very well to the point of a single application across the entire cluster. Granted it was only 64 nodes with dual processors at that time, but the price and performance were revelations to us.

Gigabit Ethernet, commonly referred to as “GigE,” runs at 1,000Mbps and was introduced in 1999. It could still use 10BASE-T twisted-pair, keeping prices low, and delivered another 10 times jump in performance for commodity networking. GigE is still going strong for small HPC systems and in homes.

Commodity networking, starting with Fast Ethernet, came about around the same time as commodity processors (PC CPUs). In a definite sense, they feed off each other. As networking got less expensive, it was cost effective to buy more PCs and add more capability, pushing PC prices down. As processors got less expensive, more systems were purchased, which needed networking, which increased the amount of networking needed and drove down networking costs.

Summary

Supercomputer processors entering the 1990s were the king of the hill. Vector processors were primarily used, and they were far faster than PC processors. During the 1990s two “branches” of supercomputing processors developed. One branch stayed on vector processors and the other used workstation processors. Vector processor instruction sets operate on one-dimensional arrays, or vectors. They were widely used in supercomputers because they proved very fast on code that could be vectorized. Coming into the 1990s, each supercomputer company made their own vector processors and compilers.

During the 1990s these vector processors got faster and gained more performance by the addition of larger and more complex vector pipelines. Moreover, the clock speeds were much higher than on PCs by a large margin. They also went from 32-bit to 64-bit during the 1990s, with the penultimate vector processor appearing in the Cray X1 in 2003. Although it was a vector processor, it also had non-uniform memory access (NUMA) capability. In 2005, the processor got an upgrade to 1,160MHz and a dual-core processor. Again, all these vector processors were built by each supercomputer company for their specific systems. Therefore, the extreme costs of designing, testing, and manufacturing these processors were spread over only the systems that were sold by that company. As a result, the processor cost was remarkably high, especially compared with x86 processors.

The second branch of supercomputer processors came primarily from workstation processors. A perfect example is the Cray T3E, which used the DEC Alpha 21164 processors running at about 300MHz. The goal in using these processors was to reduce cost: Because there were many workstations, the costs of processors were spread across more systems, reducing the costs. Additionally, workstation processors had better performance, at that time, than PC processors.

The use of workstation processors helped reduce the costs of supercomputers, although still costly enough that they were centralized resources. Contrast this with PC processors that sold in the millions, allowing development costs, which were not too different from supercomputer processors, to be distributed across possibly hundreds of millions of processors. Supercomputers could, at best, spread the same costs across hundreds or thousands of processors.

Commodity networking, specifically Ethernet, was not initially driven by PCs. The initial push came from connecting research centers, government, and military sites. Connecting universities came next, followed by financial institutions and Telcos, both primarily at the corporate level, which helped push down the costs to something a company could afford.

This growth in networking kept reducing the price to the point that it became cost effective for high-end PCs or PC-based workstations. At that point, the hundreds of million PCs started driving down Ethernet costs extremely quickly. More PCs were bought because they were cost effective, specifically in the corporate world. These PCs needed networking, which helped drive down the costs of networking. As the networking costs came down, people could afford to network, and buy, more PCs, which drove down the cost of PC processors.

To summarize:

  • PC processor development costs could be spread across many more customers than supercomputer processors (think hundreds of millions of machines versus a few thousand). PC processors were very inexpensive compared with supercomputer processors.
  • The commodity-priced PC CPUs started adding new features, faster clock speeds, and more parallelism through the 1990s, primarily because development costs could be spread across such a huge number of systems. Supercomputer processors did not have this economy of scale, so prices remained very high.
  • Eventually, in the early 2000s, PC CPUs had features roughly equivalent to supercomputer processors; in some cases, they were faster.
  • Commodity networking grew very quickly in the 1990s, driving down prices so that individuals could use Ethernet to connect their PCs. Into the early 2000s, this meant relatively fast and low-latency networking was available for PCs.

In the next part of this series, I will continue to explore the factors that led to the development of modern-day HPC systems.