Before you begin - your options for Management Plane HA
Quick-reference - what are your options for Management Plane HA?
High-availability concerns for the Tetrate Management Plane can be addressed in a variety of ways. Before delving into the detail of Management Plane HA, it is very useful to have an idea about which approach you wish to take.
What is your HA requirement for the Management Plane?
The following questions will help you determine your posture with respect to implementing HA for the Management Plane.
- The majority of Tetrate users operate a single Management Plane instance and maintain a highly-available external configuration database (PostGres). With appropriate automation, they can deploy a new Management Plane quickly should this ever be required.
- Some Tetrate users operate a pair of Management Plane instances in an 'active-standby' fashion. Configuration may be located in a shared external database, or in dedicated databases with a replication process. Failover may be achieved by updating the DNS FQDN for the management plane.
Consider which approach is most appropriate, given the level of automation you can achieve and your HA goals:
What are the failure modes for the Tetrate Management Plane?
The Tetrate Management Plane can become unavailable due to connectivity, infrastructure or possible internal errors.
There are three potential failure modes for the Tetrate Management Plane:
- Connectivity Error: Some or all of your control planes cannot connect to the Management plane, perhaps due to a firewall or routing error. The Management Plane continues to operate normally, and these control planes continue to functional locally, but their configuration is not updated and metrics are not forwarded. Where possible, address this issue by restoring connectivity.
- Infrastructure Error: The Management Plane or its dependent components (e.g. external database) are lost due to a catastrophic infrastructure error. Control Plane clusters continue to function locally. Where possible, address this by restoring the Management Plane using existing pipelines. Alternatively, perform a failover if you are operating an active-standby cluster.
- Internal MP Error: The Tetrate Management Plane is composed of multiple, loosely-coupled services that can be restarted or upgraded as needed. Where possible, attempt to restore service by troubleshooting and restarting the affected services. If you cannot restore the Management Plane, consider a re-install using existing pipelines, or a failover if you are operating an active-standy cluster.
What are my options for high availability and disaster recovery?
A single Management Plane can be quickly restored if the database is reliable and automation makes installation quick and easy.
In all cases, you should maintain backups of the Management Plane configuration (the secrets and authentication tokens stored in the K8s cluster) and the service configuration (stored in the Postgres configuration database).
Tetrate strongly recommends that you host a highly-reliable Postgres database, external to the Management Plane K8s cluster. If you maintain an external, highly-reliable database, this makes it easy to reinstall the management plane, or to operate an active-standby pair.
You then have several options for high availability and disaster recovery:
- Orchestrate the installation of the Management Plane: If you stand up the Tetrate Management Plane using a CI/CD or other automation, you can quickly restore the management plane.
- Operate an active-standby pair: With a shared, external database, it is straightforward to operate an active-standby pair of Management Plane instances. To failover, shutdown the active MP instance and update the DNS FQDN to point to the standby.
- Perform database replication: Where it's not preferable to operate a shared, external database, you can operate a postgres database for each MP instance. You will then need to implement a replication process to synchronize with the standby MP instance and its database.
How reliable is the Tetrate Management Plane?
By design, the loosely-coupled, microservice architecture with watchdog operators provides a high degree of self-healing and reliability.
The Tetrate Management Plane is composed of multiple, loosely-coupled services running in a Kubernetes cluster. These services are managed by a set of Kubernetes operators which perform watchdog operations (restarting crashed components) and orchestrate reconfigurations and upgrades. This architecture brings a form of 'self-healing' that provides a high degree of availability in the event of internal failures and operational changes such as upgrades.
The critical components of the Management Plane are:
- A small amount of internal configuration (certificates, secrets), stored in the Kubernetes namespace
- Desired service configuration, stored in a local or remote Postgres database Additionally, metrics are stored in a local or remove ElasticSearch database, but these are generally not critical.
Provided that regular backups of the internal and service configuration are maintained, and a Management Plane can be quickly deployed using a GitOps or CI/CD pipeline or equivalent automation, the Management Plane can then be quickly restored in the event of an irrecoverable internal failure.
The most common failure scenarios arise from Kubernetes cluster errors or loss of connectivity to the Management Plane.
What is the effect of a Management Plane failure?
A failure does not affect dataplane traffic, but will potentially delay or prevent configuration changes and metrics collection.
You'll recall that the Tetrate architecture is loosely coupled, with Control Plane services on each workload cluster and a central Management Plane service.
The scenarios guide explains the various failure modes in full detail, but in summary, if the Management Plane fails, some or all of the following effects will be seen. The impact depends on the nature if the failure, ranging from an individual component to a total loss of connectivity to the Management Plane:
- Dataplane operations are not affected. Services will continue to run without any impact
- Metrics Collection may stalled, for control planes that cannot forward metrics to the collectors in the management plane
- Configuration updates may be stalled if the Management Plane cannot push configuration to some or all of the control planes
- State changes are not communicated from control plane clusters to the management plane, so (for example) East-West failover cannot be orchestrated
- GitOps changes from the control plane clusters will be queued until the management plane is available
Control Plane clusters will forward any pending metrics and will obtain updated configuration once the Management Plane is restored.
What HA approach should I choose?
Given the summary information above, you may choose one of the following approaches:
- Run a single Management Plane instance with an external Postgres database: Management Plane failures are infrequent, and you can quickly install a new instance using existing, tested pipelines
- Run an active-standy Management Plane pair with an external Postgres database: If you cannot easily stand up a new cluster and re-install the Management Plane, an active-standby approach gives you a failover MP instance, ready for action
- Run an active-standby Management Plane pair with dedicated, external databases: If you don't want to rely on a single, shared database for an active-standby approach, you can operate one per MP instance and replicate changes from active to standby
- Run an active-standby Management Plane pair with dedicated, on-cluster databases: If you value the self-contained, appliance-like approach of installing all dependencies in the Management Plane cluster, you can operate an active-standby pair with on-cluster databases
In more detail:
Run a single Management Plane instance with an external Postgres database
This is the most common approach, taking into account that management plane failures are infrequent and do not affect dataplane traffic. With appropriate automation, a fresh Management Plane can be deployed in a matter of minutes.
This approach requires a reliable, external Postgres database, and users are very strongly recommended to maintain up-to-date backups in the event that the database fails and needs to be restored. Cloud providers offer scalable, reliable implementations of Postgres that can be suitable for use with low operational overhead.
Strengths
- Highly-reliable, lowest resource usage, typically easily enabled by existing automation approaches.
Limitations
- Requires the user to maintain a reliable Postgres database service, either using cloud-hosted or on-prem instances. Users often run a database cluster that is resilient against the failure of one database instance.
- Highly-reliable, lowest resource usage, typically easily enabled by existing automation approaches.
Run an active-standy Management Plane pair with an external Postgres database
This approach is suitable if you cannot easily deploy a new Management Plane instance or Management Plane cluster, or if you have DR requirements that mean you need to deploy redundant Management Planes in different locations.
Failover can be achieved by using a floating FQDN (DNS name) for the Management Plane, and switching it to the standby instance when required.
Strengths
- Allows for quick failover (limited by the TTL for the DNS name) with reduced complexity.
Limitations
- The shared Postgres database is a single point of failure; this is generally addressed by good operational practices to minimize risk of catastrophic failure, along with regular or real-time backups. For example, users often run a database cluster that is resilient against the failure of one database instance.
- Requires careful installation of the Management Plane, and an easily-modifiable DNS name.
- Additional cost of the idle, standby Management Plane installation.
- Allows for quick failover (limited by the TTL for the DNS name) with reduced complexity.
Run an active-standby Management Plane pair with dedicated, external databases
This approach builds on the previous one, adding a dedicated database for each Management Plane instance.
You will need to implement some form of active replication between the database instances.
Strengths
- By running two databases, you protect yourself from the impact of a catastrophic database failure in the active instance.
Limitations
- You need to implement and maintain replication from the active to standby database instances, perhaps by using the existing Postgres implementation.
- More resource intensive; you must account for the cost of the standby Management Plane instance, its database and the network traffic to replicate transactions.
- Requires careful installation of the Management Plane, and an easily-modifiable DNS name.
- By running two databases, you protect yourself from the impact of a catastrophic database failure in the active instance.
Run an active-standby Management Plane pair with dedicated, embedded databases
The TSB management plane optionally supports an embedded, on-cluster Postgres database, managed by a kubegres operator. Kubegres provides local high-availability, and you can further configure a 'standby' Management Plane instance that replicates from the active database.
This approach is commonly used for demonstration environments, but is not preferred for production environments because the complexity of hosting a database service on Kubernetes introduces additional failure modes and resource constraints. The approach may be suitable for users who require an appliance-like implementation of TSB, with no external dependencies.
Strengths
- Superficially simpler (no need to maintain an external database), although managing replication is an operational overhead.
Limitations
- On-cluster Postgres hosting has additional failure modes and resource constraints that may reduce the reliability of the database, particularly for large, active deployments.
- More resource intensive; you must account for the cost of the standby Management Plane instance, its database and the network traffic to replicate transactions.
- Requires careful installation of the Management Plane, and an easily-modifiable DNS name.
- Superficially simpler (no need to maintain an external database), although managing replication is an operational overhead.
How to proceed
Before proceeding with this guide, determine which Management Plane HA strategy is most likely to be appropriate for you. Implementation can be complex, particularly when considering active-standby configurations and database replication.
You'll find elements throughout the guide that will be relevant for whichever approach you take, and which will help you customize the approach for your unique deployment environment.