SC25 Highlights

Jeff Layton

I want to begin by saying that I have read some summaries of Supercomputing 2025 (SC25) that said it was much smaller than SC24, both in terms of the number of people and the number of exhibitors. Because I write articles about high-performance computing (HPC), I got to use the Press Room, so I saw the daily stats. This year the attendance was about 16,000 – smaller than SC24 in Atlanta, which had almost 18,000 people; however, SC24 had around 500 exhibitors, whereas SC25 had 560.

I’m not sure why people thought about 10 percent fewer people but 10 percent more exhibitors is “much smaller.” The show floor was massive, spilling over into the St. Louis indoor football arena (I’m sure it’s used for other things). I admit I got lost a couple of times and had to pull out my map to orient myself. Was it noticeably smaller than SC24? I don’t think so.

Shoutout to Cristin Merritt, the Communications Chair. The show was very well run, and she brought in goodies from the UK. I lived in the UK a long time ago and these really brought back fond memories of being in school and going to the nearby village to get sugar-based snacks after lunch. I have no idea how you got all of that to St. Louis in your luggage, but it was amazing. Thank you, Cristin.

People often look for a single theme at conferences, and I’m no exception. However, the other writers I spoke with all agreed, SC25 really had no one theme. For the purpose of reviewing SC25, I created two themes: the split of GPUs into AI and HPC processors and the use of artificial intelligence (AI) in HPC. Two weaker semi-themes at SC25 were quantum computing going mainstream and power and cooling. As always, though, you want to see what’s new from the various vendors and what surprises are announced or shown.

TOP500 – Meh or Interesting?

Many people get excited about the TOP500 announcement, but the 2025 announcement seemed a bit anticlimactic. The European exascale machine was exciting to some people, including me, but overall, it was kind of “meh” to many people, who wanted to talk about who isn’t on the list (not including China, of course), which turned into a discussion of why the giant AI systems around the world aren’t present.

One person postulated the reason is that they are too busy to stop for a TOP500 run. Because a proper High Performance Linpack (HPL) run for the TOP500 takes some time, this suggestion makes sense. The run involves testing subsets of nodes to look for lower performing nodes to exclude or fix, and to find a “sweet spot” in the input parameters. In general, the bigger the system, the longer this process takes. Why stop training a model for a week or more to run TOP500? These models make money for companies.

Moreover, AI systems are focused on AI and not on the TOP500. AMD announced that the next generation of their processors/accelerators will have two versions: The AMD MI455X GPU for AI and the AMD MI430X processor for double-precision floating point (float64 or FP64)-oriented code. Data centers with the MI430X would more likely run HPL for the TOP500, which doesn’t mean the MI455X can’t run HPL. The MI455X data centers might choose not run it, either because the chip is designed for something else, so it might not perform as well as the MI430X, or the users of the ML455X have no interest in the HPL benchmarks.

Note that you are now seeing the same GPU product line bifurcate, with one GPU focused on AI and another on HPC. The AI market has gotten big enough that it can demand its own accelerator, and the GPU manufacturers can make money from them. Equally interesting is that the HPC market remains large enough for targeted GPUs – and the manufacturers can still make money from them.

AI in HPC

Someone pointed out that instead of ignoring or criticizing AI, HPC is looking at AI as a new technology that can be used to help solve problems. (apologies to the person who mentioned this – I can't remember to whom I should give attribution, but thank you). One way AI can help HPC is the use of multiprecision (i.e., a mixture of precisions used in the same code to solve a problem as efficiently as possible).

Multiprecision

Multiprecision can reap great benefits by reducing the amount of data to be stored, because lower precision is used, which can improve performance. Who wouldn't like this one-two punch of less data to store and faster computation? However, you must be sure your code produces the “correct” answer with these changes. A key aspect is that you don’t have to get the same answer as higher precision, but you must get an answer that satisfies the original problem and has an acceptable error.

For example, if the answer to a problem is supposed to be 10.84591045, is 10.84510439 acceptable? They are different answers, but is the second one correct? That decision needs to be made on the basis of criteria by the user or developer.

In a general sense, AI-oriented GPUs use less FP64 and more of other precisions (e.g., single-precision floating point (float32 or FP32), half-precision floating point (float16 or FP16), 8-bit-precision floating point (bfloat16 or BF16), etc.) to compute their answers. AI doesn't necessarily need lots of FP64 computing power, but it can use lots of lower precision for computations. The Ozaki scheme allows you to use these lower precision compute units to achieve higher precision results, including FP64. You can even use 8-bit integer (INT8) tensor cores for double-precision general matrix multiplication (DGEMM) computations that surpass the performance of FP64 units, and there are discussions about more emulations for DGEMM functions.

A great example of an algorithm that has been adapted to use lower precision units is HPL-MxP (a mixed-precision HPL benchmark), as well as work on other common algorithms such as fast Fourier transforms (FFTs) or eigenvalue computations.

AI for Surrogate Modeling

The use of AI, particularly large language models (LLMs), for surrogate models is an increasing area of research. SC25 had several papers on the topic, along with posters and lots of discussion. The use of surrogates is not a new concept. In my field of aerospace multidisciplinary optimization (MDO), surrogate models were a hot topic in the 1990s. The concept is: Rather than run a great number of full-fidelity simulations requiring enormous amounts of computational power and time, you run a select subset of the simulations and then create some sort of model that fits these points. You then use the model to find solutions for specific problems.

For example, if you are interested in MDO solutions of aircraft, you would run a set of high-fidelity simulations with techniques from design of experiments, Latin hypercube sampling, or Kriging methods. Next, you create a response surface from these data points and use it as a substitute for a real set of simulations – but with a huge speedup.

With LLMs, the approach is the same, but the response surface is replaced with an LLM model. Once trained, you then run inference to get the simulation result expected at that data point. More recently, this model is now called a digital twin. Examples of this approach were presented at SC25 by NVIDIA.

An interesting prospect for a developer is to push the model to produce results outside the range of tested parameters in the model. I recently read a short blurb on LinkedIn where someone was claiming you couldn’t go past the defined ranges. (I didn’t bookmark the article – my apologies). You certainly can go past these limits, but your risk is likely to be greater. You might get answers that aren’t correct or are nonsensical, but then again, you can get those types of answers even if your variable values are inside the parameter range. In fact, you might find something new, interesting, and innovative by going outside the parameter ranges. However, be sure to check these models to make sure they still satisfy reality, including running full simulations.

Quantum Computing, Power, and Cooling

Before SC25, quantum computing seemed more like research than a product, despite the claims of vendors. For one SC in particular, quantum computing vendors were collected in one area of the exhibit floor. SC25 seemed to have more quantum vendors, and they were integrated with other vendors just like any other technology.

I attended a press briefing where experts projected that in three to five years, quantum computing will be mainstream, and users will be able to schedule jobs on quantum computers just like any other node in the system. I believe this prediction – that it will be just another computing tool to solve problems – will be significant, if true (or close).

Power and Cooling

I think a quasi-theme around power and cooling started the previous year at SC24, with a big jump in power and cooling vendors on the exhibit floor. This year the number of vendors in this area seemed to jump again. Two large booths on the show floor were from Legrand, a French company that produces energy saving technologies, and Mitsubishi Heavy Industries.

Like it or not, power and cooling are now a fundamental part of system design. Before, you had CPUs, memory, networking, and storage. Then, accelerators, primarily GPUs, were added to the mix. Now, as more power is used in CPUs and GPUs and as the number of GPUs in the system increase, power and cooling are added to the design variables. Planning how you will power a system or build a system within a power limit is just as important as the network design.

Power sources are increasingly becoming a big question. Data centers are putting up wind turbines and solar arrays as fast they can be built, but it is still not enough. Data center builders are turning to nuclear energy for power, even spending the money to reopen previously closed nuclear power plants. Small nuclear reactors (SMRs) are being developed that take much less time to install and are mostly self-contained, and there is talk of re-using decommissioned nuclear power plants from navy submarines and ships for power generation because they are no longer used to power naval vessels.

Other sources of power such as large fossil fuel-powered generators are being used for data centers. Some companies are developing and are now offering jet engines to provide power. Aircraft manufacturer Boom Supersonic is developing natural gas turbines for power generation. These engines spend a great deal of their time running at full power, so they are designed for these operating conditions.

Equal to the importance of power is cooling. Do you have enough air-cooling capacity for your system design, or do you need to add more? Does cooling become a limiting factor for system design? Many system manufacturers are turning to liquid cooling their nodes. You must carefully plan for liquid cooling, just like air cooling, but with the addition of monitoring the system for leaks.

Companies are looking at underwater data centers for cooling and even data centers in space. Both have promises as well as challenges. Perhaps more conventionally, companies are also looking to use waste heat from data centers for various tasks, including power generation.

What Didn’t Go So Well

I won’t discuss the choice of St. Louis for SC25. Enough people have done this, and I can’t really add anything to the conversation. However, I will mention that I tried for three days to get ribs from the Sugarfire Smoke House without any success. I tried going as early as 10:00am before it got busy – and no ribs. I really liked their other food, but I read that the ribs were amazing. Will the rib hoarders please step forward?

Moving away from food, one thing I did want to bring up is the birds of a feather (BoF) sessions that were open to all attendees. I love BoFs. I learn so much more in these sessions and in talking to the speakers or organizers than I do going to paper presentations. However, in my humble opinion, the SC organizing committee does not provide enough time slots for BoFs, so they all heavily overlap. I’m forced to pick between three or four topics that are going on at the same time. Perhaps they could add a few more times and even overlap BoFs and paper sessions or record them and post the slides and recordings.

My last topic is a pet peeve. I talked to a good friend on the show floor, and they reminded me of a characteristic of SCs. A great many people that go to SC love to play the "Stump the Chump" game. That is, they just love to argue with someone to prove they are right about something. Typically, this “something” is a minor point that no one cares about, something that will be obsolete in a year or two, or a personal preference. (What color do you like? You had better say yellow or you are wrong for so many reasons.) I’ve been going to SC conferences for well over 20 years, and I have always seen these people wandering the exhibit floor, or I’ve overheard arguments in the hallways. This is true in other disciplines, as well, but I fail to understand any reason to play Stump the Chump. Why not just talk to people and find their positions on certain topics and ask them to explain them to you? If you don’t agree, perhaps offer to talk to them after the conference. You can also work together on common topics, even if you don’t agree. This attitude of having to prove that you know more than someone else or that you are correct and they are wrong is so baffling and emotionally draining to the point where I wonder about attending future SC conferences.

Next year’s SC26 is in Chicago. Looks like we’re moving back to cold climates. I’m not a fan of cold locations. I have fallen three times at the Salt Lake City shows. Plus, I now have to pack several sweaters, along with coats and boots. I look like a Kardashian going on an overnight with a couple of checked bags and a carry-on (although most of them have their own jets). On the other hand, I really like Chicago, so I’m looking forward to next year. I just hope enough people bring back aerosols so we can get a noticeable warm-up over the Chicago area.