The Cloud’s Role in HPC
I’m not a big Bob Dylan fan, but I think some of his lyrics are interesting. The one that applies to HPC is “Times They Are a-Changin’.” Some of this change is going to the cloud. I don’t think the shift to cloud computing for some HPC applications is happening because a CIO or director of research computing watched a cloud computing commercial and thinks it sounds really cool. Rather, I think HPC has existing workloads that fit well into cloud computing and can actually save money over traditional solutions. Perhaps more importantly, I think HPC has evolved to include non-traditional workloads and is adapting to meet those workloads, in many cases using cloud computing to do so. Let me explain by giving two examples.
1. Massively Concurrent Runs
I know an HPC center with users that periodically submit 25,000 to 30,000 jobs as part of a parameter sweep. That is, they run the same application with 25,000 to 30,000 different data sets. Many times the application is a Matlab script, a Python script, an R script, a Perl script, or something that is very serial (i.e., it runs well on a single core). The same script is run but with thousands of different input files, resulting in the need to run thousands of jobs at the same time. Many times these applications run fairly quickly – perhaps a couple of minutes – and many times they do not produce a great deal of data.
A very closely related set of researchers are doing OS and security research for which they run different simulations with different inputs. For example, they might run 20,000 instances of an OS, primarily a kernel, and explore exploits against that OS. As with the previous set of researchers, the goal is to run a huge number of simulations as quickly as possible to find new ideas about how to protect an OS and kernel. The run times are not very long but they must run the tests against a single OS. Consequently, they run thousands of jobs at the same time and then look through the results and continue with their research.
What is important to both sets of researchers is to have all of the jobs run at nearly the same time so they can examine the results and either focus on a small subset of the data sets and run more granular input data sets or try yet more data sets (broaden the search space). Additionally, these users want to broaden the search space so they can get either more detail or examine more options. The result is the need for more cores. This same HPC center has users asking for 50,000 and 100,000 cores for running their applications. The coin of the realm for these researchers is core count and not per-core performance.
Another interesting aspect of these researchers is that they don’t run these massive job sets all of the time. They create the input data sets and create the job arrays, and then run the job array. Once the jobs are done, however, it takes time to process the output to understand it and to determine the next step. What is important to these researchers is to have all of the results before doing this post-processing. If this doesn’t happen, the researcher has to wait days for the jobs to finish before post-processing the data, so you can see why it’s important to have all of the jobs run at the same time. Getting more efficiency from the hardware is not the problem because faster hardware will only improve the research time a little bit. Reducing the run time from 120 seconds to 100 seconds wouldn’t really improve their research productivity. What improves their productivity is to have all of the jobs run at the same time.
I originally thought this scenario was confined to my experience with a particular HPC center, but I was wrong. I’ve spoken to several other people, and they all have similar workload characteristics with varying sizes (several hundred to 50,000 cores). Although this might not describe your particular workload, a number of centers fit this scenario, and this number is growing rapidly.
2. Web Services
Another popular scenario in HPC centers that I’ve seen and heard about is the increasing need for hosting servers for classes or training, for websites (internal and external), and for other general research-related computing in which the applications are not parallel or might not even be “scientific.” I heard one person refer to this as “Ash and Trash computing,” probably because it refers to running non-traditional HPC workloads; however, it’s becoming fairly common.
Consider an HPC center with training courses or classes that need access to a number of systems. A simple example is a class in parallel computing with 30 students. These students might need many cores per person for their course, and they won’t be pushing the performance of the systems; however the data center will need a number of systems for the class. If they need 20 cores per student, that’s 600 cores just for this single course.
The need for dedicated web servers for research is also increasing. The websites they host go beyond the classic personal websites. Researchers want, and need, to put their research on a website that allows them to share results, interact with other researchers, and show-off their research. An increasing number of web-based research tools are available, such as nanoHUB and Galaxy. I know of one HPC center that has close to 20 Galaxy servers, each tuned to a specific research project.
HPC centers are discovering that it makes much more sense to handle these non-traditional workloads themselves. The reasons are varied, but in general, HPC centers understand research better than the departments that worry about mail servers, databases, and ERP applications. These enterprise computing functions are critical to the overall center, but research and HPC require a different kind of service. Moreover, HPC centers can react much more rapidly to requests than the enterprise IT department.
Time for a Change
HPC is being asked to adapt to new roles based on research needs. These needs include applications that need a tremendous number of cores but don’t need a great deal of performance, as well as applications such as web servers, classroom and training support, and web-based applications and tools that are not traditional HPC applications. These workloads fit into the HPC world much better than they fit into the enterprise world.
These changes are everywhere. They may not be a large force, and they might not be as pervasive in your particular HPC center, but they are happening and they are growing more rapidly than traditional workloads. Consequently, I’ve started to refer to this new generation of computing as Research Computing. If you like, research computing is a superset of traditional HPC, or traditional HPC is a subset of research computing. I also like to think of research computing as adding components, techniques, and technology for solving problems that traditional HPC cannot or might not solve. One of these technologies is cloud computing.
Cloud Computing for Research Computing
I freely admit that I scoffed at the use of cloud computing to solve traditional HPC problems when it started coming into vogue. The idea of taking perfectly good hardware and layering virtualization on top of it with a new set of tools and APIs – in a data center that I didn’t control or have access to – seemed abhorrent. As I saw HPC morphing into research computing, though, I began to realize that cloud computing is a tool for solving problems that I could not easily solve before.
The first example – in which the users need a massive number of cores and need them to run at the same time – can be solved by classic HPC systems with a large number of cores, a reasonably fast network for data traffic (10GigE?), and the associated clustering software. Job arrays can be written to start and schedule 25,000 jobs. With a general-purpose cluster, jobs that have to run at the same time will likely have to wait for a large number of nodes to finish their jobs before enough cores are free to launch the queued jobs. For potentially long periods of time, the nodes will be idle, wasting CPU time waiting for cores to become available. Couple this with the short period of time users need the cores, and you have even more wasted CPU cycles. Many HPC centers have struggled with this inefficient scenario.
Perhaps a more efficient way of providing resources for these researchers and their workload is to take a standard server, virtualize it, and oversubscribe the server, providing more virtual machines (VMs) than the server has physical cores. For example, a four-socket server that has 16 cores per socket has a total of 64 cores per server, and you can run perhaps 128 or 256 VMs on the system as long as each VM has enough memory to run the application. Remember that for this type of workload, performance is not the most important metric. If, by virtualizing the server, you lose 10% to 30% in performance, you are not really negatively affecting the research. In fact you might be enhancing the research because you can easily provide enough resources in a short period of time without wasting CPU cycles waiting for the physical cores to be available. Moreover, virtualizing the server allows a much smaller number of physical servers to be purchased to meet the needs of these workloads.
Another possibility is to push these workloads into the cloud. Because performance is not the most important concern, cloud computing resources could be very appropriate. The researchers don’t need a large number of cores all of the time in this scenario, so buying dedicated hardware, even if it is virtualized, might not be the most efficient use of resources. Also, don’t forget that many of these workloads don’t do a great deal of I/O, so data movement to and from the cloud could have very little effect on performance. Running these applications in the cloud (e.g., Amazon or Google) might be a much more cost effective approach than providing local resources, even if they are virtualized.
I’ll look at a simple example using Cycle Computing. Recently, Cycle Computing announced that they started up 10,600 VM instances inside Amazon EC2. It took two hours to configure the instances and nine hours to run (a total of 11 hours) and cost US$ 4,362. This is $0.4115 per instance for 11 hours, or $0.037 per instance per hour.
For the researcher that needs to run 25,000 instances at the same time, on the basis of Cycle’s experience, I’ll assume it takes two hours to start up these instances. I’ll also assume it takes 15 minutes to run the application if all jobs start at the same time (0.25 hours). The total time is 2.25 hours for 25,000 instances. At a price of $0.037 per instance per hour, the resulting total cost is US$ 2,081.25. Now assume the researchers do this three times a week for an entire year (a total of 156 runs). The total for the year is then US$ 324,675. At first blush, this seems like enough money to buy your own on-premises system using over-subscribed virtualized machines. Or is it?
For comparison purposes, assume the building block is a four-socket AMD node with 16 cores per socket (64 physical cores). Also assume that you oversubscribe the physical cores 3:1, producing 192 VMs per physical server. Furthermore, assume that each VM needs at least 2GB of memory, resulting in about 512GB of memory per node. Using Dell’s handy on-line configuration tool I configured a 2U server that meets the specifications and has a price of about $14,500. The power usage for such a node under load is about 992W (almost 1kW), and the idle load is 434W. Assuming power is $0.14/kW, the power cost for a single system is about US$ 535 (8,721 hours at idle, 39 hours at peak load). Therefore, to buy and operate a single node over one year is roughly US$ 15,000. Using the yearly cost from the Cycle Computing example, you can afford to buy roughly 21 systems. Using 192 VMs per server, you only end up with 4,032 VMs, whereas with Cycle Computing, you get 25,000 instances, even including the two hours to configure all of them when they are needed.
To match the number of VMs needed (25,000), you need about 131 servers. The purchase cost for these is US$ 1,899,500. The yearly power bill is US$ 65,500. Over one year, this works out to a total of US$ 1,965,000. Over three years, the total is US$ 2,096,000. On the other hand, using cloud computing via Cycle Computing, the price for one year is US$ 324,675; over the three years, the price is about US$ 974,025. Cloud computing works out to half the cost of a dedicated system for these workloads.
This is a very simplified analysis because one could argue that the idle systems could run someone elses jobs, but the point of the comparison was to determine whether it was better to buy dedicated systems to run 25,000 jobs at the same time, 156 times a year, using virtualized systems or to use the cloud. I think this rudimentary comparison still shows that this particular workload is more efficient in the cloud than using on-premise resources, even with oversubscribed virtual machines.
Although the title of this article is about HPC in the cloud, it’s really about two things: First, is the evolution of HPC into research computing and how cloud computing can be used to solve research computing problems.
At first, it was fairly easy to dismiss cloud computing for traditional HPC workloads. The “HP,” after all, stands for “high performance,” and doing anything to reduce performance is counterproductive. You are paying more and getting less. However, new workloads are being added to HPC all of the time that might be very different from the classic MPI applications in HPC and have different characteristics. The amount of computation in these new workloads is increasing at an alarming rate – so much so, that I think HPC is giving way to RC (research computing).
In this article, I gave two examples of new workloads that are helping to morph HPC into RC. The first example is an application class that needs to run on thousands of cores serially, doesn’t run very long, and doesn’t go a great deal of I/O, but all instances of the application need to run at the same time. The applications are varied, but they share these common aspects, particularly the need to run all the applications at about the same time.
Until a few years ago, I didn’t hear too much about these applications, but in recent years, they’ve become more and more common at HPC centers. Improving the per-core performance will not help the overall productivity because the applications run so quickly. What really improves productivity is to run all instances of the application at the same time. This makes the researcher much more productive than having just a few applications run at a time.
In the second example, applications run on the web are being used for data post-processing, as well as data creation. A need for this group of applications is web servers on which to share and investigate data and research results. In the past, these applications had to be run on IT department web servers, even though they are really RC applications, and the IT departments don’t really know how to handle these requests because their mission is a bit different. Consequently, these applications are now increasingly run by the research computing team.
The second theme of this article is that many of the workloads in RC can be tackled by cloud computing that is not necessarily on-premises (i.e., in the public cloud). The characteristics of some of the workloads are such that putting them in the cloud can save money relative to running them on traditional HPC hardware, and in many cases, it can save time because you can spin up a very large set of resources very quickly that is larger than anything you have in the HPC center (with possibly a few exceptions). Moreover, moving these workloads to the cloud can also make your traditional HPC systems more efficient because you do not have large applications blocking the queues.
I consider cloud computing a tool or technique for solving research computing problems. Nothing more or less. It’s not a panacea, not should it be ignored. Issues that must be addressed include data movement and security, but it also can save you money and make your traditional HPC resources stretch further. If you examine your workloads and their characteristics carefully, I think you will be surprised by how many can be run easily in the cloud.