A self-healing VM system

Server, Heal Thyself

Configuration Management Suite

In choosing the configuration manager (CM), a few specific requirements must be met. The CM must be robust enough to use the API of the hypervisor to power on (and off) the instance directly, without using the guest operating system. It must also be able to create a new instance with a local image or template file. A remote template file can also be used, although it introduces a single point of failure to the system. If all nodes have their own image file, redundancy is preserved.

To replace the faulty nodes, the CM must have an API to destroy instances. Some hypervisors require instances to be registered into a central database to be considered active. The CM must take care not to corrupt this database by dutifully entering and removing instances from it each time. If a new node is to be tested or monitored properly, its network address must be detectable by the monitor if it queries the hypervisor. This functionality can come from the CM, the node monitor itself, or a package or program inside the operating system.

When the system performs a replacement operation, it must first verify that the node has failed. A node that is being reconstructed still exists but will not respond. Thus, the system should not attempt to delete a node while it is being built. Protections against this happening are accomplished with flags or notes attached to the instance after it is ready for use. The system is then set up to delete only instances that fail AND have the flag or notation present.

The CM is also responsible for loading the configuration files for the new instance's node monitor. This load is progressive, meaning the new instance will be set up to monitor all of the nodes currently seen by the node that is building it. If any additional configuration files are needed, the CM will transfer them and then restart any services that use the files. Services are then started on the new instance and reloaded on the constructing node.

One of the last things a CM need to do during the rebuild process is to make sure the new node can build nodes of its own. The image files are copied over with whatever protocol is available. If secure key files need to be generated, this is done as well. Some CM suites require registration into a central database, which occurs at the end stages. However, if a CM client is needed, it should be embedded within an image or template.

A typical setup in the test environment is shown with the code snippet from an Ansible playbook in Listing 2. Each of the three sections separated by a blank line are called "plays." The first play, Shut down faulted VM, attempts to power off the faulty VM in vCenter through the vmware_guest module. Required arguments are the faulted VM's name; the username, password, folder, and virtual data center where the VM resides; and the hostname and fully qualified domain name (FQDN) of vCenter. Lines 10 and 21 instruct the CM to attempt a delete action only if the vm_facts variable is defined, which only happens when the instance exists in vCenter. This play reports success even if the VM is already powered off.

Listing 2

Ansible Playbook Snippet

01 - name: Shut down faulted VM
02   vmware_guest:
03     hostname: "{{ Vv_hostname }}"
04     username: "{{ Vv_username }}"
05     password: "{{ Vv_password }}"
06     name: "{{ vm_name }}"
07     datacenter: "{{ Vv_datacent }}"
08     folder: "{{ Vv_folder }}"
09     state: poweredoff
10   when: vm_facts.instance is defined
12 - name: Destroy faulted VM
13   vmware_guest:
14     hostname: "{{ Vv_hostname }}"
15     username: "{{ Vv_username }}"
16     password: "{{ Vv_password }}"
17     name: "{{ vm_name }}"
18     datacenter: "{{ Vv_datacent }}"
19     folder: "{{ Vv_folder_1 }}"
20     state: absent
21   when: vm_facts.instance is defined
23 - name: Deploy VM from Template file
24   vmware_deploy_ovf:
25     hostname: "{{ Vv_hostname }}"
26     username: "{{ Vv_username }}"
27     password: "{{ Vv_password }}"
28     wait_for_ip_address: yes
29     validate_certs: no
30     datacenter: "{{ Vv_datacent }}"
31     name: "{{ vm_name }}"
32     networks: { name: DYNAMIC_NET }
33     folder: "{{ Vv_folder_1 }}"
34     ovf: "{{ Vv_homedir }}/coreimage.ovf"
35     cluster: "{{ Vv_cluster }}"
36     datastore: "{{ Vv_datastore }}"
37   register: INSTANCE

In the second play, Destroy faulted VM, vCenter attempts to delete the VM, again through the vmware_guest module. The arguments and conditions are the same as from the previous play. This play reports success even if the VM has already been removed.

The third and final play, Deploy VM from Template file, attempts to recreate the instance, this time with the vmware_deploy_ovf module. The arguments and conditions for this module are the same as for the previous two plays, along with some additional lines:

  • wait_for_ip_address tells the CM not to continue with the next play until it can either detect that a valid IP address has been given to the instance (in this case via DHCP) or the timeout for waiting has expired.
  • validate_certs is set to no to allow a connection to vCenter when its SSL certificates are not valid.
  • cluster and datastore tell vCenter which host cluster will service the instance and on which storage device the virtual machine disk (VMDK) file will reside.
  • networks names a valid network in vCenter that is mapped to the OVF network name.
  • register saves this Ansible module's return values (if any) to a variable called INSTANCE for later use throughout the rest of the playbook.

This last play will not succeed if an instance already exists with the same name in the same virtual data center. vCenter denies any other node from deploying an instance of that name once one of the nodes has already started the deployment process. If this play fails, Ansible does not run the rest of the playbook.

The remainder of the playbook copies the necessary configuration files the new node will need for its node monitor and configuration manager, as well as the OVF template file used to make future nodes. Also included in the playbook is a line to place a special text string in the new instances annotation field that "certifies" the new node as part of the system, which means it can now be deleted when it becomes faulty or unreachable.

While other configuration management suites are available (Puppet, Chef, Salt, etc.), in this test environment, I used Ansible version 2.7 because it is the first version that includes support for vmware_deploy_ovf [3]. Ansible also works without the need to install a client on the new node.

Further optimization can be achieved with the use of PID files to restrict each node so that it can only perform one rebuild at a time. This step will save bandwidth and reduce the chances that a node's resources diminish to the point of triggering its own node monitor.

Watching Your Nodes

The node monitor tracks the health and connectivity of all nodes in the system, including the one running the monitor itself. If any new nodes are created, destroyed, or recreated, it must adjust its records to match the new system state and propagate these records to other nodes.

Tracking is done either by checks inherent in the monitoring program or by external programs invoked to perform the task(s). Ideally, you should be able to configure the number of successes, number of failures, and interval of time between checks separately. To allow for latency and complexity, you need separately configurable timeouts for each check.

On critical failure of any node, the node monitor will invoke the configuration manager to destroy and rebuild the node. Non-critical failures may be fixed if the configuration manager is run to restore the node's previous settings. Once a fault or failure condition is triggered, any corrective action must only occur on state transition, meaning the corrective action occurs once and does not repeat unless the monitor detects that the check, which continues to be run, has succeeded at least one time after the fault condition is set.

Listing 3 shows a code snippet from a monitor (Monit) configuration file. One of these files exists for, as well as on, each node in the system.

Listing 3

A Monit Configuration File

01 check program ping_test_node_5 with path "/usr/bin/ansible-playbook ping_monitor.yml -e 'vmname=webbox5'"
02         with timeout 15 seconds
03         if status != 0 for 3 cycles then exec "/usr/bin/ansible-playbook rebuild_node.yml"
04         if status != 0 for 3 cycles then exec "/usr/bin/mail -s 'Node 5 ping failed' sysadmin@giantco.cxm"
06 check program sync_all_node_5 with path "/usr/bin/ansible-playbook sync_files.yml -e 'vmname=webbox5'"
07         every 15 cycles
08         if status != 0 for 30 cycles then exec "/usr/bin/mail -s 'Node 5 sync failed' sysadmin@giantco.cxm"

Line 1 declares a new check program called ping_test_node_5 and defines the command to be run for the test. The command is ansible-playbook, the argument is the playbook file ping_monitor.yml, and the extra argument (vmname=webbox5) is the name of the node.

Line 2 declares the maximum amount of time to wait for the check program to finish (15 seconds). If the check program is still running after this time, it is forcibly terminated and its execution is considered to have failed.

Lines 3 and 4 are checking for the exit status of the command. If the status is anything but zero three times in a row, another ansible-playbook command is run on rebuild_node.yml to reconstruct the faulty node. Simultaneously, email is sent to sysadmin@giantco.cxm under the same exit status conditions.

Line 6 declares another check program, sync_all_node_5, that runs an Ansible playbook called sync_files.yml with webbox5 as the target. This program is run every 15 cycles and synchronizes the data files used by all monitors on each node.

Line 8 states that if the synchronization fails twice in a row (15 x 2=30 cycles), email is to be sent to sysadmin@giantco.cxm .

Although many programs and packages (HAProxy, Nagios, Keepalived, etc.) have the ability to perform this function, in this test environment, the monitoring program is Monit version 5.25, which triggers only on state transition and requires one success to reset a fault state. Each cycle was equal to 20 seconds.

Because DHCP is used, it is necessary to use an Ansible playbook to query the hypervisor for each node's IP address, which must be done each time the ping check (or synchronization task) is executed because of the nature of DHCP. Non-nodal checks were configured in the main configuration file so that they would not be propagated by the synchronization task. In this test environment, a local HTTP server index page, the root filesystem capacity, and system health (CPU, memory usage) were all monitored separately in the main /etc/monitrc configuration file.

System health checks were set to send a simple alert when triggered, and the HTTP index page was set to reboot the server when triggered so that it would be rebuilt by the other nodes. If needed, you can set different corrective actions. Listing 4 shows an example playbook for synchronizing nodes.

Listing 4


01 - hosts: localhost
02   tasks:
03   - name: Get facts for node
04     vmware_guest_facts:
05       hostname: "virtualhome.giantco.cxm"
06       username: "admin"
07       password: "pass123"
08       datacenter "losangelesctr"
09       name: "{{ vmname }}"
10     register: vm_facts
12   - name: Synchronize unique files missing from remote directory
13     command: rsync -avz -e -ignore-existing /etc/monit.d/ {{ vm_facts.instance.ipv4 }}:/etc/monit.d

As stated previously, the operating system can be configured to respawn the node monitor for greater resiliency.

Putting It All Together

Figure 3 summarizes the systems operation. Each node is represented by a light orange box on the left side of each panel. Panels A through C detail the removal of the bottom node, which the remaining two nodes each attempt to destroy. The two nodes then race to create a replacement, and a new instance is created by the victorious (topmost) node in panel D.

Figure 3: This table details each step of a repair action.

Panels E through H show the re-creation of the bottom node. Only the topmost node and the DHCP server(s) participate in this process. The instance is finally considered a working node in panel H.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.