Failure Scenarios for the Tetrate Management Plane
We will consider the following failure scenarios:
- Workload Cluster
- Edge Control Plane
- Loss of connectivity from Management to Workload
- Central Control Plane
- Management Plane
- Management Cluster
... and evaluate the effect of the failure on the following operations:
- Running production workloads - the availability, security and correct operation of production workloads
- Local Cluster operations - this includes the direct modification of cluster configuration such as kubectl actions, and indirect modification (i.e. changes made by the local Edge Control Plane to apply TSB policies or update service discovery endpoints)
- Metrics Collection - the central collection and storage of metrics from remote workload clusters
- Management Operations - TSB configuration changes, performed by GitOps, API or Management UI
We will look at the typical recovery scenario when a failed component recovers or is restored.
Architecture and Terms
In this guide, we’ll use the following Architecture description:
![]() |
---|
- Workload Cluster: A Workload Cluster is a kubernetes cluster that hosts production workloads
- Production Workload: A Production Workload is an app or service running in a Workload Cluster. For avoidance of doubt, ‘Production Workload’ also includes non-production workloads
- Data Plane: The Data Plane is the local Istio instance, deployed in the Workload Cluster
- Edge Control Plane: The Edge Control Plane is the Tetrate software component installed in the istio-system and other namespaces (e.g. cert-manager, xcp-multicluster) in the Workload Cluster. It configures the local Istio dataplane, and reports state to the Central Control Plane
- Management Cluster: The Management Cluster is the kubernetes cluster that hosts the Tetrate management plane components (Management Plane, Central Control Plane).
- Central Control Plane: The Central Control Plane is the Tetrate software component that accepts configuration from the Management Plane and status information from Edge Control Planes. It evaluates the entire configuration, then distributes necessary configuration updates to each Edge Control Plane
- Management Plane: The Management Plane is the Tetrate software component that entities (GitOps, API clients, UI clients) interact with. It provides RBAC access control to control which entities can CRUD which configuration. Configuration is stored locally, and synced to the Central Control Plane
For more information, please refer to the Tetrate Architecture Documentation.
Terms
- "Failure" means loss of availability of the relevant component
- "Recovery" means regaining availability of the relevant component, likely with out-of-date configuration or status
- "Restoring" means reinstalling a failed component, where it’s not possible to recover the component
- ✅ A component or service is not affected
- ⚠️ A limited loss-of-service occurs.
- ❌ A total loss-of-service in the affected component occurs
Failure of Workload Cluster
Scenario: There is a catastrophic failure of a single Workload Cluster.
![]() |
---|
Impacts
Operations | Impact | |
---|---|---|
Running Workloads | Local workloads are unavailable. Workloads on other clusters are unaffected. Workload HA (Tier1 and EW gateways) ensures no interruption in service. | ⚠️ |
Local Cluster Ops | Local cluster changes cannot be made. Other clusters are unaffected. | ⚠️ |
Metrics Collection | Metrics cannot be collected from the local cluster. Other cluster metric collection unaffected. | ⚠️ |
Management Ops | Changes to the affected Workload Cluster are queued, and applied when the cluster recovers. All other management operations are unaffected. | ✅ |
Recovery
If the local cluster recovers, configuration will be quickly updated and metrics collection will resume.
Restoration
If necessary, the Tetrate Edge Control Plane can be re-installed. When the cluster is re-introduced to the management plane, it will sync to the correct configuration.
Failure of Edge Control Plane
Scenario: There is a catastrophic failure of the Edge Control Plane in a single Workload Cluster.
![]() |
---|
Impacts
Operations | Impact | |
---|---|---|
Running Workloads | Running Workloads in local or remote clusters are not affected. | ✅ |
Local Cluster Ops | Local cluster changes (kubectl) are unaffected. You can continue to push updates to the cluster. Depending on the nature of the failure:
| ⚠️ |
Metrics Collection | Metrics are collected locally, reduced, then forwarded to Management Plane ElasticSearch. If collector services are unavailable, metrics might not be collected. | ⚠️ |
Management Ops | Changes to the affected Workload Cluster are queued and applied when Edge Control Plane recovers. All other management operations are unaffected. | ✅ |
Recovery
If the local cluster recovers, configuration will be quickly updated and metrics collection will resume.
Restoration
If necessary, the Tetrate Edge Control Plane can be re-installed. When the cluster is re-introduced to the management plane, it will sync to the correct configuration.
Loss of Connectivity - Workload to Management Cluster
Scenario: There is a loss of connectivity between the Workload Cluster and the central Management Cluster.
![]() |
---|
Impacts
Operations | Impact | |
---|---|---|
Running Workloads | Running Workloads in local or remote clusters are not affected. | ✅ |
Local Cluster Ops | Local cluster changes (kubectl) are unaffected. New workloads may run partially configured until connectivity is restored. They may acquire global cluster policies that are already in the cluster but will lack namespace-targeted and fine-grained policies. Local service discovery endpoints for remote services are not updated. GitOps operations may be interrupted. | ⚠️ |
Metrics Collection | Metrics are collected and queued in affected Workload Clusters. Long-term connectivity loss will result in some loss of metrics. | ⚠️ |
Management Ops | Changes to the affected Workload Cluster(s) are queued and applied when connectivity is restored. All other management operations are unaffected. | ✅ |