Three more pitfalls are presented with insights on how to avoid some common mistakes

Five HPC Pitfalls – Part 2

In Part 1, the role of benchmarks and assumptions about commodity hardware were presented as possible pitfalls for those entering the HPC space. In this second part, pitfalls associated with open software, system integration, and storage aspects will be addressed. Other aspects, including costs, consultants, and relationship intangibles, will round out the discussion and help make potential and current HPC practitioners all the wiser.

Pitfall Three: Free Software Has No Cost

Another misconception that extends far beyond HPC clusters is the notion that openly available software is free and therefore adds no cost to a cluster. Although the initial acquisition cost of open software might be nonexistent, software support and integration most certainly have associated costs. This time and effort has to come from either the user or a vendor and does not vanish because the software was freely available. In the case of HPC clusters, these costs can be quite substantial and are often the responsibility of the customer. If the customer takes the “learn as you go” approach to managing an open software stack, additional time and cost should definitely be expected.

It is possible to purchase a complete Linux-based cluster distribution or download one of several freely available options. In general, the software is very similar and often based on a commercially available Linux distribution (i.e., RHEL from Red Hat, Red Hat rebuilds like CentOS, or SELS from Novell), but the support options can vary from vendor to vendor. A small number of hardware vendors will support an entire software stack on their hardware – that is, they have the internal expertise to support a deep issue as it pertains to their hardware – but most vendors merely sell third-party cluster distributions or leave the choice to the user.

A professionally supported cluster software distribution has a definite advantage. Having an expert manage software upgrades, security updates, and bug reports is important to production-level sites, but as with vendor-supplied hardware, there are limits to what a vendor will provide. If a user requires a new version of a package (or a package that did not exist in the original distribution), he will be left to install and support these packages on his own. This situation is similar to those who prefer to “roll their own” cluster software for use on top of standard Linux distributions.

One of the biggest issues facing cluster administrators is upgrading software. Commonly, cluster users simply load a standard Linux release on each node and add some message-passing middleware (i.e., MPI) and a batch scheduler. This arrangement offers a quick victory for the administrator, but could cause serious upgrade issues and downtime in the future. For instance, upgrading to a new distribution of Linux might require rebuilding MPI libraries and other middleware. User applications might also need to be rebuilt with a third-party optimizing compiler that does not yet support the new distribution upgrade. Administrators and users are then required to find workarounds or fixes that allow users to run the new software. Other packages can suffer a similar fate, resulting in frustration and lost productivity. In summary, free software does not imply free support or easy integration. The open nature of Linux-based software does allow optimal flexibility and choice within a local user environment, but it can also places extra responsibility on the administrator or user.

Pitfall Four: Integration Is Simply Racking and Stacking

Installing cluster hardware is an important job, often requiring someone with the experience and skills to integrate the hardware into a complete system. This process includes component placement, network wiring, and testing. Most large clusters are built on-site and, as such, often result in unforeseen issues that can mean delays or even additional expense. Preferably, your vendor will stage the cluster, perform acceptance testing, and then install it at your site. Although pre-staged clusters are usually more expensive than site-built systems, you are assured that the cluster will be available for use within a day or two of delivery. When the cluster is extremely large, it might not be possible to pre-stage the hardware because of space or power constraints at the vendor facility. These systems require a professional installation team as well.

The difference between connecting hardware and delivering a workable cluster is notable. Customers should always have an acceptance testing plan under which their actual applications must be demonstrated to run optimally on a cluster. An acceptance testing plan is a far cry from installing a Linux distribution on each node and testing network connectivity. A vendor’s ability to deliver a usable cluster is called the “stand-up rate” and is, in effect, how long it takes to provide a fully functioning cluster from the day of delivery. A good stand-up rate should be measured in days, not weeks.

One shaky assumption users often make is that, once the cluster is operational, the integration is finished. In some respects, the installation has been successful, but in almost all cluster deployments, a several-month period usually ensues during which local integration and system “shake-out” takes place.

Local integration can vary by site, but it often involves user rights management, storage issues, resource schedule policies, system tuning, user/administrator education, and basic administration issues. This task is perhaps one of the most difficult aspects of cluster acquisition and often the most overlooked. Indeed, it is when support from the vendor can be most critical. In many cases, however, the large tier-one vendors are not organized to provide the “high-touch” level of support required at this juncture, and many are glad to part company once the hardware is installed. Smaller vendors and integrators have a distinct advantage because they typically provide direct and knowledgeable support during this process.(i.e., you can talk to the guy who actually built or configured your system).

Pitfall Five: NFS is Enough

Storage is often the forgotten aspect of initial HPC designs. During the specification stage, customers often assume that storage is simply the number of hard disks to be placed in one of the administrative nodes. The successful use of Network File System (NFS) as a cluster-wide filesystem invites the assumption that all storage needs can be addressed in this fashion. In reality, the amount and type of cluster-accessible storage you need depends largely on application requirements rather than on the total size of a storage system.

NFS works fine for many clusters, but issues tend to develop in clusters with more than 100 nodes, and in fact, most people are surprised to learn that NFS was not originally designed for a cluster environment. The upcoming parallel NFS (pNFS) is intended to help solve this problem, but most clusters just run NFS “out of the box” without any optimization. Additionally, the multicore nature of cluster nodes usually requires more I/O traffic to individual nodes. Poor file I/O performance can lead to poor utilization of your compute nodes and diminish the expected performance of your cluster.

Parallel filesystems might be needed in the event that you have a large and fast I/O requirement. Some of the more common parallel file solutions include Lustre, Panassas, Gluster, and IBM GPFS. The choice of filesystem should be based solely on your needs because no one parallel filesystem solution fits all. As mentioned previously, benchmarking might be the only way to make this determination. New technology developments in storage networks like Fibre Channel over Ethernet (FCoE), Fibre Channel over InfiniBand. (FCoIB), or NFS/RDMA (NFS using Remote Direct Memory Access) might need to be considered as part of a filesystem solution. Evaluating these options should be done carefully because cost and performance can vary widely.

Understanding Costs

A common theme running through all of these pitfalls is the need to fully understand the additional (or hidden) costs associated with clusters. Failure to account for these costs will result in lost productivity, additional expenses, and, in the worst case, a non-functioning cluster.

In general, the pitfalls mentioned in this two-part article can be placed in the following categories: validation/optimization/specification costs, software integration/maintenance/upgrade costs, and infrastructure costs. This list is not meant to be all-inclusive; other costs could certainly arise, and a careful review of your requirements should help focus your plans.

One way to help minimize these unexpected costs is to prepare a detailed specification and Request For Proposal (RFP) to be used when contacting vendors. You can find a detailed article on How to Write a Technical Cluster RFP at ClusterMonkey.net. It is imperative that you include some form of acceptance testing for the cluster to ensure that it is working properly. Qualified vendors should be able to answer pointed HPC questions. If they cannot, a pitfall could lie ahead.

Consider Enrolling an Ally

The realities of cluster acquisition can be quite sobering. As mentioned here, customers are now required to make many decisions that did not previously exist when purchasing a fully integrated supercomputer from a single vendor. The multivendor nature of commodity clusters has provided the double-edged sword of choice and responsibility for most HPC users.

One approach that has proven very successful for many organizations has been to employ a third-party consultant or integrator to assist with the specification, acquisition, and integration of the cluster. At first glance, some might consider this an additional and unnecessary expense, but in light of the possible pitfalls I’ve mentioned, enlisting an experienced ally can actually save money in the long run. An integrator or consultant can lower your overall project cost because they will help you purchase only what you need and will make sure it functions properly. Indeed, savvy customers sometimes choose to keep a knowledgeable integrator/consultant under an ongoing support contract should future issues arise with the cluster. Even if you have existing technology personnel, using a consulting contractor or integrator during the acquisition process can help keep you on schedule through the extra work that will be required for a successful project, and it is important to include the integrator/consultant at the very beginning of the process, before you start talking with hardware vendors. Some integrators also sell hardware, but it is best to find a “hardware neutral” partner that can recommend what is best for your requirements.

The value of an integrator/consultant is the knowledge and relationships that they already have within the HPC market. Often a good integrator can tell you what works well and what doesn’t before you commit to any specific hardware. Or, which vendors fully understand and support the HPC market and which are recent additions to the market and are still learning the ropes themselves. In general, a good integrator/consultant can help you navigate the potential pitfalls within the cluster acquisition and integration process.

Important Intangibles

If you choose to bring an integrator/contractor in as a partner for your project, be sure to check out the relationship intangibles. These are the aspects of your partnership that do not appear on the statement of work or contract. For instance, some questions to consider are: How well does the integrator/consultant listen? Are phone and email messages returned promptly? How knowledgeable are the integrator/consultant team members? How well does the integrator’s team work with your team? These and other issues in your relationship with the integrator/consultant can be every bit as important as the work they will perform. Good and honest communication is essential. Problems will inevitably arise; how they are handled is what separates a good partner from a bad partner.

Perhaps the best way to evaluate the intangibles is to get references from other customers. Any experienced integrator/contractor will have no problem providing references from past and present customers. Before you make a decision, you might even want to interview the members of the integrator/contractor team. These people are going to be the “boots on the ground” when it comes time to make things work.

Finally, trust your gut instincts. If you don’t get a good feel for one integrator/consultant, look for another. Although good HPC integrators are somewhat rare, they are available. Your styles should complement each other and support your HPC mission.

Conclusion

Specifying, procuring, and managing HPC resources can be a challenging task. As discussed, a few common pitfalls can cause major problems and incur extra costs. In particular, understanding the nuances of public benchmarks, commodity hardware, free and open software, integration, and storage will allow you to make better decisions.

Beyond the initial hardware purchase you have many costs to consider, and understanding these will help create better expectations and minimize problems during both your purchase and your installation. Using an HPC integrator/contractor can lower project costs and help you navigate the pitfalls as you travel the path to successful and productive HPC in your organization.