Efficiently planning and expanding the capacities of a cloud

Cloud Nine

CPU and RAM

The admin comes under fire from a completely different angle during capacity planning of CPU and RAM resources in the cloud. First, the issue of overcommitting these two assets is extremely sensitive. In the case of RAM, overcommitting is prohibited from the outset: KVM offers this option, but under no circumstances does the admin want the out-of-memory killer of the Linux kernel to take down any virtual machines (VMs) in the cloud because it runs out of RAM. Overcommitting available CPUs works better, but don't exaggerate – single overcommitting is usually the best option.

The relationship between CPU and RAM is of much greater importance for capacity planning in the cloud. If you look at the offer for most clouds, you will notice that a fixed number of CPUs are often assigned a certain amount of RAM. For example, if you commit a virtual CPU, you have to add at least 4GB of RAM, because it is the smallest available option.

Cloud providers have not switched to this system because they want to annoy their customers. Rather, the aim is to keep waste as low as possible on the supplier side. For example, a state-of-the-art server with two Xeon E5-2690 processors offers a total of 14 CPU cores and 28 threads. Systems like this are usually equipped with 256 or 512GB of RAM. As a result of overcommitting CPU resources, a total of 56 virtual CPU cores are available, of which the provider normally passes 50 on to its customers.

If you assume the ratio of CPU to RAM provided through corresponding hardware profiles is 1:4, a customer using 12 virtual CPU cores must add at least 48GB of memory. If five of these VMs are running on the system, a small remainder of CPU and RAM remains, which is divided between any smaller profiles.

If you do not follow the 1:4 rule of thumb and offer an arbitrary ratio instead, a customer with especially CPU-heavy tasks could commit a VM that needs 32 virtual CPU cores but only 16GB of RAM. If you also have a customer who needs a large amount of RAM but only a few CPUs, the available RAM would simply be idle on this system. It would be virtually impossible for the cloud to start additional VMs because of the lack of available virtual CPUs.

The cloud provider has to keep this waste as low as possible; otherwise, data center capital is tied up. The ratio of 1:4 for CPU and RAM is appropriate in most cases, but it is not carved in stone. If you can foresee for your own cloud that other workloads will occur regularly, you can of course adjust the values accordingly. Ultimately, you should order hardware to suit the foreseeable workload.

Collecting Data

If the cloud is well planned and built, you can move on to the trickiest part of capacity planning in the cloud – gazing into your crystal ball to identify the need for new hardware at an early stage. From a serious point of view, it is impossible to predict every resource requirement in good time. A customer that rents a platform overnight and needs 20,000 virtual CPUs in the short term would cause most cloud providers on the planet to break a sweat, at least for a short time.

However, such events are the exception rather than the rule, and the growth of a cloud can at least be roughly predicted if the admin collects the appropriate metrics and evaluates them adequately. You should ask: Which tools are available? Which metrics should responsible planners collect? How do you evaluate the data?

Collecting metrics works differently in clouds than in conventional setups. For example, given a cloud of 2,000 physical nodes, with 200 values per system to be collected every 15 seconds, you would net 1,600,000 measured values per minute, or a fairly significant volume.

Metering solutions such as PNP4Nagios attached to monitoring systems can hardly cope with such amounts of data, especially if classical databases such as MySQL are responsible for storing the metric data in the background.

Solutions like Prometheus or InfluxDB, whose design is based on the time series database (TSDB) concept, work far better. Monitoring also can also be treated as a byproduct of data collection. For example, if the number of Apache processes on a host is zero, the metric system can be configured to trigger an alarm. Both Prometheus and InfluxDB offer this function.

The Right Stuff

If you rely on Prometheus or InfluxDB for monitoring and trending, the next challenge is just around the corner: What data should you collect? First the good news: Both Prometheus and InfluxDB offer a monitoring agent (Node Exporter and Telegraf, respectively) that regularly reads the basic vital values of a system, even in the standard configuration. These tools allow you to collect values such as CPU load, RAM usage, and network traffic without additional overhead.

In most cases, you can find additional agents for the cloud: An OpenStack exporter for Prometheus [2], for example, docks onto the various OpenStack API interfaces and collects known OpenStack data (e.g., the total number of virtual CPUs in use), which also finds its way into the central data vault.

Therefore, if you rely on TSDB-based monitoring, you can collect the values required for metering with both Prometheus and InfluxDB without much additional overhead (Figure 1). Cloud-specific metrics are read by corresponding agents; one of the most important parameters is the time it takes the cloud APIs to respond.

Figure 1: Prometheus is based on the time series database concept and can handle many millions of entries in a short time.

The prettiest data treasure trove is of no use if the information it contains is not evaluated meaningfully. Out of habit, many admins tend toward the golden mean. If API requests are processed in 50ms on average, the situation is quite satisfactory.

However, this assumption is based on the mistaken belief that the use of the cloud is distributed evenly over 24 hours. Experience has shown that this is not the case. Activity during the day is usually much higher on a platform than at night. If API requests are processed very quickly at night but take forever during the day, the end result is always a satisfactory average number, but completely useless to explain the experience of the daytime user.

Similarly, if a particularly large number of CPUs (amount of RAM, network capacity) are free at night, but the platform is running at full capacity during the day, customers who want to start VMs during the day do not benefit. If enough bandwidth is available at night, but the network is completely saturated during the day, the frustration on the customer side increases.

« Previous 1 2 3 4 Next »