As the demand for robust and performant solutions is increasing, High Availability (HA) is becoming a critical requirement for companies that cannot afford system disruptions and downtime. Implementing HA for your infrastructure is a key strategy to mitigate the impact of planned and unplanned interruptions, such as system maintenance, power outages, component failures, human intervention, etc., and thus ensure that your systems remain continuously operational.
In this blog, we will introduce HA capabilities of the Micro Focus Operations Bridge 2017.11 container deployment. We will also review the HA implementation as well as various considerations that need to be taken into account.
Please note that the information described in this blog does not apply to the Operations Bridge classic deployment (for information on how HA is implemented in Operations Bridge Classic, see the High Availability section of the Operations Bridge help center).
Recommendations for configuring Operations Bridge High Availability
Operations Bridge HA deployment using the Container Deployment Foundation (CDF) is achieved by component-level redundancy, i.e. you need to add additional operational capacity in order to handle declines in performance or hardware failures. Therefore, to make your system(s) highly available, we recommend that you:
- Build in highly available redundant storage – Use a redundant NFS server for your container deployment
- Add highly available database instances – Use a redundant external database
- Set up a CDF Kubernetes cluster with multiple master and multiple worker nodes – You must have at least 3 master nodes and at least 2 worker nodes that have enough resources to host OMi.
For a list of installation requirements, see the Prepare for the installation section of the Operations Bridge help center.
High Availability of the Kubernetes Cluster
As already stated above, the use of multiple master nodes is a mandatory requirement to implement high availability. In a multi-master setup, a connection redundancy is achieved by using a virtual IP (VIP) address that is shared by all members of the HA pool. If the master node owning the VIP is down, the VIP is automatically transferred to another master node belonging to the same HA pool.
You configure a VIP address during the CDF installation. One option is to edit the HA_VIRTUAL_IP parameter in the <foundation_temp_dir>/install.properties file. (This parameter is mandatory if you want to install multiple master nodes.) You can also set up HA_VIRTUAL_IP by specifying the -ha-virtual-ip option of the install.sh CLI command.
As a next step, in the Connection screen of the Operations Bridge installation wizard, set the External hostname to a fully-qualified domain name that resolves to the value of the HA_VIRTUAL_IP parameter:
When these properties are specified, the load balancer (ingress) instance and keepalived are launched on each master node (keepalived is a routing software that binds the VIP to a master node). If the node owning the VIP goes down, another master takes over the VIP thus providing High Availability.
High Availability of Operations Bridge components
Operations Manager i (OMi)
The Operations Manager i (OMi) HA deployment is achieved by component-level redundancy. That is, for the HA setup, you must have three master nodes and at least two worker nodes that have enough resources to host OMi.
When running the Installer, select the option Enable high availability for Operations Management i:
Note: It is currently not possible to re-configure and enable HA once the Operations Bridge installation has been previously completed.
When the OMi HA is enabled, two OMi pods are created by the OMi StatefulSet controller (This controller manages the deployment and scaling of the pods.) Each OMi pod starts on a different worker node and in sequential order. This means that the second replica is not scheduled until the first one is fully functional.
The created OMi pods are single-server deployments with the names omi-0 and omi-1. After the initial startup, omi-0 is configured as a primary (active) data processing server (DPS) and omi-1 as a backup (passive) DPS. If, at any point, the active DPS fails, the passive DPS automatically takes over within a small interval. The duration of the interval depends on various factors. In the best-case scenario, the interval is minimal and an OMi operator will see no interruption at all. In worse cases, a connection error is displayed for a short period of time while the UI reconnects automatically. Any data that is sent during the downtime is buffered and gets processed as soon as the backup DPS has taken over as the active DPS, so no data is lost.
If a node that hosts one of the OMi pods becomes unreachable, this pod goes into an "Unknown" or "Terminating" state. The pod is not rescheduled to a different node in the Kubernetes cluster unless the node object is manually deleted, deleted via the Node Controller, forcefully deleted, or shut down by kubelet to remove the pod's entry from the API server. This limitation exists because Kubernetes enforces the applications running in a StatefulSet to have a stable network identity and storage. In general, Kubernetes tries to avoid the creation of multiple instances of the same pod to prevent data corruption.
For a list of further considerations, see the Notes and Limitations section of the Configure scaling and high availability topic in the Operations Bridge help center. To learn more about Kubernetes, see the Kubernetes documentation.
Business Value Dashboard
Business Value Dashboard (BVD) pods are highly available by default, even if you have not scaled them out. Nonetheless, if one of the worker nodes fails, restarting the pods will cause a short interruption of your system. To ensure constant system availability in such situations, you can increase the number of BVD receiver (bvd-receiver-deployment) and web server (bvd-www-deployment) pod replicas. For a procedure on how to implement this, see the Scale a BVD deployment horizontally section of the Configure scaling and high availability topic in the Operations Bridge help center.
A short downtime will then only occur if the BVD Redis pod is affected by the crash of a worker node or the Redis process. In this case, Kubernetes restarts the BVD Redis pod, which usually takes less than a minute. Any data sent during the interruption of the BVD Redis pod is buffered by the BVD receivers, so no data loss will occur.
Operations Bridge Reporter and Performance Engine
Operations Bridge Reporter (OBR) and Performance Engine (PE) do not support running pods multiple times on different worker nodes. Therefore, when one of the worker nodes fails, all corresponding pods have to be restarted. The interruption depends on the restart time of these pods and may take several minutes. For PE, you can expect a restart within 3 minutes and for OBR within 5 minutes – if all containers are affected by a worker node failure.
If you have feedback or suggestions, don’t hesitate to comment on this article.
Explore full capabilities of Operations Bridge by taking a look at our Operations Manager i, Operations Bridge Analytics, Operations Bridge Reporter, Operations Connector, Business Value Dashboard and Operations Orchestration documentation!
To get more information on this release and how customers are using Operations Bridge, we are happy to announce the following events you can register for:
Read all our news at the Operations Bridge blog.
Explore all the capabilities of the Operations Bridge and technology integrations by visiting these sites: