Confessions of a Patchaholic

Managing patches, service packs and updates in a heterogeneous environment is one of the leading causes of sleep deprivation among system administrators. The big question is, “How do you manage patches in a complex environment?”

Scan. Patch. Reboot. Scan. Repatch. Reboot. Lather. Rinse. Repeat. Patching is one of the many thankless and joyless duties of a system administrator. Everyone expects it to be done but no one in the Monday morning staff meeting stops to say, “I’d like to take a moment and say ‘Thanks’ to our SAs for their efforts in the scheduled patch session that they performed last night in the Test and Development environment.” It just doesn’t happen that way.

The more likely scenario is that the designated team leader sends out an email blast at 4:00am to notify his email distribution list that the latest patch session has gone well – and any remaining unpatched systems will receive their patches manually tonight.

Patching is a necessary evil, and if you have a good system for applying patches, patch night goes well most of the time. There are always a few stubborn systems that don’t come back up correctly or completely after a reboot, and there are those that won’t accept a particular patch. Some systems have to be coddled through patch night like a newborn baby during its periodic feedings, and some systems are needier than others. You get used to it.

It’s a Numbers Game

If you have a few systems, you can patch manually or use the operating system’s built-in tools for patch management. On Linux systems, you can use Debian’s apt-get, Red Hat’s yum, or SUSE’s YaST. Windows systems have Windows Update, and Apple has its own Mac OS software update service. Commercial Unix systems generally receive a quarterly patch bundle or individual security updates by manual download and install.

But, for more than two dozen or so systems, patching becomes a daunting task and certainly one for which you need a good plan. A good plan includes a patch schedule, a manual patch option, and a pre-patch test environment. The pre-patch test environment allows you to apply patches to systems that are similar to the production system but are used for test or development purposes. If these systems fail, you don’t take a production system out, and you can decide to postpone or remove an errant patch.

A great plan also has some third-party management software and automation built into it. Automation does not mean allowing your systems to update themselves automatically via their vendor sites or repositories. Automation means applying patches to groups of systems using a patch management suite. Automated patching usually will net a greater than 90 percent success rate and often approach 100 percent. Chances are very good, however, that you will never achieve 100 percent patch compliance with an automated system.

Automated but Not Automatic

Automatic patching sounds good to some executives but the loss of control and lack of tracking makes it a very bad idea. You can’t allow your systems to update on their own because you’ll have no change control, and you won’t have time to research any unstable patches. Nor will you be able to schedule outages for the inevitable reboots as part of the process. Automatic updating is OK until someone makes you pay a penalty for missing a service-level agreement (SLA).

Automatic updates take away control on two levels: change control and information control. Change control is the formal procedure associated with making changes to any computing environment. These changes comply with SLAs, federal compliance acts, and best practices for maintaining a stable computing infrastructure. Change control generally consists of a change proposal (description of activities) that includes a backout plan, a request for comments, a peer review, and approvals from all involved parties (SAs, network, management, project management).

Yes, it’s a painful and lengthy, but necessary, process that not only provides a documented history of changes to systems, but it also provides a veil of protection against unauthorized change backlash from clients.

As a system administrator, you need to know what’s going on in your support area. Change control helps you maintain some sanity in the environment. That documentation gives you the information you need to track when changes occurred relative to any problems that you encounter. For example, if your logs tell you your web service began experiencing core dumps the past two weeks, you can go back into change logs and check out which patches were applied to the errant system just before the core dumps.

Without that documentation, you’ll spend hours troubleshooting and floundering about for an answer.

Exception Management

For reasons stated earlier, some systems won’t allow themselves to be patched with the use of an automated tool because of security, legacy tools, service exceptions, or an unknown reasons. Short of re-imaging these troubled systems, you’ll have to patch them manually for the entirety of their life cycles. You’ll probably spend as much time performing manual patches as you do your automated ones. Plan for it.

To say that managing patches in a complex, multilocation, multiplatform, diverse-operating-system, security-enhanced network is difficult is an understatement of the magnitude of the task facing system administrators. And, it isn’t just security patches you have to worry about. Firmware, BIOS, and drivers are part of the mix, adding a special level of complexity to your job because you’ll need physical access or Integrated Lights-Out connectivity to apply them.

In the Toolbox

It’s time to look into the patch management system possibilities. Not every available software program is listed here but these three are worth further exploration. I personally have experience with all three of these systems and have used them over the past 12 years in large complex environments.

VMware Update Manager

For VMware virtual data centers and VMware shops, you might never find a better tool for patch management than VMware’s own Update Manager (VUM). With it, you can patch and update your ESX hosts and your Windows and Linux virtual machines (VMs). It has a simple three-step process to patch management:

  1. Baseline creation
  2. Compliance scan
  3. Remediation

Using profiles, you can establish compliance baselines that VI Center uses to update data centers, hosts, clusters, VMs, or groups. Once you’ve created a baseline for your systems, you scan them to compare that baseline to the compliance database for patches and updates that need to be installed. The remediation phase prepares the systems, applies the patches, and reboots the systems if necessary.

Scanning and remediating systems can take a very long time, so it is best to separate systems into logical groups for patching. Remember that part of the host remediation process requires migration (VMotion) of VMs from the target host to other hosts. Adherence to narrow maintenance windows is often a sketchy process and may require a phased or multi-night schedule to complete.

But, what if your data center contains systems of both the virtual and physical types? IT managers who want to use a single tool for all systems won’t be able to use VUM. Whatever tool or tools you decide to use for your virtual or mixed data center, you should allow ESX updates to occur via VUM.

The primary upside to VUM is that it’s a VMware product, and VMware knows its own products better than anyone else. The significant downside to VUM is that it requires the VUM server to have Internet access.

HP Server Automation

HP’s Server Automation (HPSA) software (f.k.a. Opsware) takes care of patch and update management for any system for which there is a client or agent, physical or virtual. Because it is agent-based, HPSA requires that firewall access for its TCP communications ports be open between the HPSA server and the agents. HPSA integrates with Microsoft Patch Network and Red Hat Network.

However, HPSA is much more than a patch management solution: It is a software suite that can manage everything from operating system provisioning to database automation to integration with network automation, storage, and applications.

HPSA’s upsides as an enterprise system management suite are its customizability and extensibility. A potential downside for large networks is the requirement for satellite HPSA servers to improve performance for systems remotely located from the primary HPSA servers.

IBM Tivoli Endpoint Manager

Like HPSA, Tivoli Endpoint Manager (TEM) is more than a patch management solution. It is agent based and supports a variety of operating systems and platforms. TEM is far less complex than its predecessor, IBM Tivoli Framework, which required a large number of physical and human resources to operate and maintain. TEM requires very little of either.

TEM supports both physical and virtual systems as endpoints. An endpoint is any system onto which the TEM agent is installed. The agent requires two-way network communications with the TEM server.

The upside to TEM is its ability to manage and monitor thousands of endpoints. TEM’s downside is its expense (~US$ 40 per device). And, for greater efficiency, you might need satellite or relay TEM servers.

Summary

If you don’t have automated tools for patch management, you should seriously consider buying into a solution. You can justify the cost by explaining the efficiencies built into these tools. With the proper tools, you can patch more systems with fewer staff, decrease outages due to human error, reduce maintenance window lengths by applying patches to multiple systems at one time, and have independent logs to track your success.

Patching is one of the least appealing aspects of your job as a system administrator, and it will continue to be an unpleasant task to perform. The hours are bad for sleep, it always requires manual intervention, and you rarely get as much as a half-hearted “Thanks” for your efforts. There isn’t much you or I can do about the human aspect of your job, but fortunately, you have tools to take away some of the pain of Patch Night.