SMS Blog
High Availability in an Illumio Environment
The cyber war never stops. While we measure the uptime of our mitigation tools in nines (eg, 99.9% vs. 99.999%), you can bet that your organization’s vulnerabilities are being picked apart 100% of the time. In fact, if you are not counting on that, you are likely in the dark as to your enterprise security posture.
As a result, zero trust architecture (ZTA) is becoming widely adopted. There is plenty of content covering it all over the blogosphere. Tools addressing data protection, continuous monitoring and validation, least privilege, segmentation, and a host of others are now common arrows in the quivers of administrators looking to keep the organization’s critical assets safe from harm.
Illumio Fundamentals
SMS partner Illumio offers a ZTA strategy that focuses on many of these principles. We have covered aspects of Illumio in past blogs. Our Director of Engineering, Miles Simpson, gives an overview of the product in his post, Zero Trust Microsegmentation with Illumio Core. Senior Engineer Ryan DeBerry offers an implementation approach for Illumio’s Policy Compute Engine (PCE) in a STIGed environment entitled Illumio PCE Automation.
Both articles provide an in-depth analysis of the product. As may be noted, Illumio’s PCE continuously monitors the agents installed on the Windows, Mac, and Linux hosts it supports. When one or more of these agents, or virtual enforcement nodes (VENs) reports a change, the PCE calculates a response and deploys the most appropriate configuration to the affected clients. The process is illustrated in the figure below…

As you may imagine, maximum availability for the PCE is key to the continuous monitoring and validation of the protected resources. While administrators cannot be expected to manage any mitigation tool with 100% uptime, Illumio includes several features and recommendations to aid administrators in this effort. This post seeks to explore those details for system administrators and engineers responsible for maintaining uptime in Illumio-managed environments.
PCE Services
When fully up and running, the PCE depends on a host of services. A complete list of those services can be displayed when checking the status of the PCE using the following command…
sudo -u ilo-pce Illumio-pce-ctl status -sv
You may review a sample of that output below…

While the image above shows the PCE in its fully operational state, Illumio offers several maintenance environments, or “runlevels,” depending on your objective. While not all these services are necessary at every runlevel, the first three, syslog, consul-agent, and service-discovery are key to maintaining the PCE in all modes.
Monitoring
An important feature supporting maximum uptime is service discovery. Service discovery is supported by the service-discovery agent which monitors the PCE for services that are failing. If a service is found to be failing, the service-discovery agent will attempt to restart the service without manual intervention. This is an important function to note when assessing the severity of a failed service.
Logging
Should a service fail that the service-discovery agent cannot rectify, Illumio offers plenty of logs to examine the issue, which can be found at…
/var/log/Illumio-pce/
I have taken the liberty of listing those logs and displaying the output below…

Clusters
The PCE comes as a cluster, and that cluster comes in several flavors. For small offices, or lab purposes, single node clusters (SNCs) are available. The SNC combines the core node, consisting of the front end, with the data node, or database into a single physical or virtual machine.
Multi-node clusters (MNCs), found more commonly in enterprise environments, separate the core nodes from the data nodes. Illumio offers a variety of multi-node clusters which include 2 core nodes X 2 data nodes (small), 2X2 (standard), 4X2 (large), and a super cluster, which is used in environments supporting more than 25,000 VENs. Note that most implementations limit the database to two nodes.
While clusters allow for parallel computing, two or more nodes in the cluster may not act autonomously. That is to say, one member of the cluster must be elected as leader, which all other nodes must follow, so processing is carried out efficiently. To maintain leadership within the cluster, a quorum is required. To achieve the odd number necessary to establish a majority, only the primary data node and the core nodes can participate.
This becomes an important element when splitting the cluster for implementation across multiple locations, which is common for high availability applications. For instance, to mitigate against single node failures, or even just increase throughput, many organizations will spread resources across multiple environments. A PCE deployment in this vein might look something like this…

When splitting the cluster, it is important to mitigate against the failure of the quorum. A 2X2 node, for example, may have one core node and one data node isolated from their peers, as pictured above. However, only the pair with the primary data node would have the majority required of a quorum. If a disaster strikes the datacenter capable of such a majority, the whole cluster will fail, as the secondary data node cannot participate. Perhaps for this reason (likely among others), Illumio requires a threshold of no more than 20ms delay for split clusters.
Warm Standby
To address these or other types of failures, Illumio offers a strategy for restoration called “warm standby.” Warm standby relies on two separate, fully installed PCEs running the same PCE version, one actively serving the environment, and one serving as a “standby” that can be used to restore the environment in case of a failure.
To prepare a cluster for the role of “standby PCE” to the active PCE, take these initial steps on the standby…
sudo -u ilo-pce illumio-pce-ctl reset sudo -u ilo-pce illumio-pce-ctl start --runlevel 1 sudo -u ilo-pce illumio-pce-db-management setup
Once those commands have been run, alter the file, /etc/Illumio-pce/runtime_env.yml to include the following lines…
active_standby_replication: active_pce_fqdn: <proposed standby fqdn>
Note that the above “<proposed standby fqdn>” is an entirely, new fully qualified domain name (FQDN) that will be introduced upon establishment of the standby. It will be important to account for the introduction of that host name within your domain name resolution (DNS) and load balancing strategies. Further, those same changes to /etc/Illumio-pce/runtime_env.yml will need to be made on the active PCE.
Once those changes are made, set the standby to the maintenance mode, or runlevel, necessary to configure it for standby operation using the following command…
sudo-u ilo-pce Illumio-pce-ctl set-runlevel 2
Next, log into the web user interface of the active PCE and access the API credentials, as below…

Click on “Add” in the upper left corner, and create your keys…

Once you’ve entered a name and optional description, click “Create.” Note the following output…

Keys in hand, copy the authentication username and enter the following command on the proposed standby…
export ILO_ACTIVE_PCE_USER_NAME=api_13ad3ab2934f4278d
Likewise, copy the secret, and do the same…
export ILO_ACTIVE_PCE_USER_PASSWORD= 21e6aafa11c8f7741bff23b16bc4f4c7bd8b7f0fc55941b0e004e4ad5ef7cf9e
The necessary credentials established, you may configure the proposed standby per the following…
sudo -u ilo-pce -E illumio-pce-ctl setup-standby-pce --active-pce <active pce fqdn>:<active pce fqdn port>
…which may look something like…
sudo -u ilo-pce -E illumio-pce-ctl setup-standby-pce --active-pce activepce.local:8443
When this command is complete, restart the cluster on the active PCE to complete the setup process…
sudo -u ilo-pce illumio-pce-ctl cluster-restart
Now, your environment is configured with a PCE in warm standby, ready to take over in case of a catastrophic failure. It is important to remember that promotion of the standby is a manual process, which must be completed with the command…
sudo -u ilo-pce illumio-pce-ctl promote-standby-pce
Thankfully, no changes should be required on the VENs to reestablish sessions with the new PCE. To confirm that the VENs are synchronizing appropriately, check the new PCE with the following…
sudo -u ilo-pce illumio-pce-ctl promote-standby-check
Conclusion
While no post can address the full scope of high availability concerns you may run into, I hope this has been a helpful overview of the basics of maintaining uptime for your Illumio environment.
For more information on this or any aspect of Illumio’s products, visit illumio.com.