Failover from Active to Standby
How to automate backup-and-restore from Active to Standby, using the embedded Postgres database.
The failover process from Active to Standby involves several steps:
- Disable the tsb and iam services on the Active instance so that it cannot receive further configuration updates
- Promote the standby instance to become the new active instance
- Update the floating FQDN that identifies the Management Plane location so that it points to the new Management Plane instance
- Provoke the control plane clients to failover to the newly-active Management Plane instance
Once you have performed a failover, you will likely want to prepare a new standby instance. If the failed, previously-active instance is undamaged, this can be configured to be a new standby.
Ensure that:
-
You are able to update the floating FQDN DNS address that identifies the Management Plane location. You will change it to be a CNAME for the new Management Plane FQDN
-
The IAM signing key (
iam-signing-key
) is present and matches the one in the active management plane cluster. If the active cluster is not available, you can check against thekid
field of the JWT tokens used by control plane cluster's componentsGet the iam-signing-key materialkubectl get secrets iam-signing-key -o yaml | yq .data.kid | base64 -d
-
The XCP Central key (
tsb-iam-jwks
) is present, and matches the one in the active management plane cluster (if possible to check).Get the tsb-iam-jwks materialkubectl get secrets tsb-iam-jwks -o yaml | yq .data.kid | base64 -d
Perform the failover
Perform the failover as follows:
Shutdown the active Management Plane
As a precaution, if the active Management Plane is running, shut it down so that it does not process configuration updates:
kubectl scale deploy -n tsb tsb iam --replicas 0
After this change, you will not be able to access the UI or API on this management plane.
Activate the standby Management Plane
Reconfigure the standby Management Plane so that it can receive configuration updates:
On the standby Management Clusterkubectl patch -n tsb managementplanes.install.tetrate.io managementplane --type=merge \
-p '{"spec": {"highAvailability": {"active": {"exposeEmbeddedPostgresInFrontEnvoy": true}, "standby": null}}}'Wait for the TSB operator to reconfigure the management plane and start the TSB services:
On the standby Management Clusterkubectl -n tsb wait --for=condition=available --timeout=600s deployment/tsb
The standby Management Plane is now running and ready to function.
Verify that the standby Management Plane is ready to take control
Log in to the standby Management Plane UI using the permanent FQDN for that Management Plane.
Verify that your Tetrate configuration is present in the Postgres database. Look for the cluster configurations (clusters will not have synced at this point) and the organizational structure (organization, tenants, workspaces) that you expect to see.
Update the floating DNS Record to point to the standby Management Plane
Update the floating FQDN (DNS record) that you use you identify the Management Plane location. Make it a CNAME for the permanent FQDN for the newly-active Management Plane instance.
Also update the permanent FQDN for the newly-active Management Plane instanceVerify that the permanent FQDN for the newly-active Management Plane instance is correct. Depending on your Kubernetes environment, the EXTERNAL-IP (or DNS) for the envoy service may have changed when the Management Plane instance was promoted to be active.
kubectl get svc -n tsb envoy
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
# envoy LoadBalancer 10.100.204.204 9d9a4612bf5.eu-west-2.elb.amazonaws.com 443:30483/TCP,5432:32641/TCP 3mAllow time for the DNS change to propagate before proceeding.
Trigger each Edge deployment to reconnect to the new Management Plane
Once the DNS change has propagated, you can trigger all Edge deployments to reconnect by terminating the front envoy on the old Management Plane:
On the old Management Clusterkubectl scale deploy -n tsb envoy --replicas 0
This will disconnect the Control Plane Edges from the old Management plane; they will attempt to re-resolve the DNS and re-connect. At this point, they should then connect to the newly-active Management Plane.
Validate Status in TSB UI
Using the permanent FQDN for the new Management Plane, go to the TSB UI and review the Clusters page. When each workload cluster connects to the new Management Plane, you will see its status and a last sync timestamp.
Use the permanent FQDN when testingWhen testing and debugging, always use the permanent FQDN to make sure you are connecting to the correct Management Plane!
Validate Status using tctl
If preferred, you can use
tctl
to validate the status of each cluster.First, reconfigure
tctl
to talk to the correct management Plane cluster.You can list the workload clusters (
tctl get cluster
) and inspect the status of each:tctl status cluster my-cluster-idNAME STATUS LAST EVENT MESSAGE
my-cluster-id READY XCP_ACCEPTED Cluster onboarded
Prepare the new Standby
If you initiated a failover because of a failure within the previously-active Tetrate Management Plane, you should then delete that instance on the basis that it was not recoverable, so cannot be reused.
You can follow the installation instructions to install a new, standby Management Plane instance.
Reuse the previously-active Management Plane instance
If you initiated a failover because of a failure external to the Tetrate Management Plane, such as a network connectivity issue, you may be able to re-use the installation as a new standby instance once the failure is resolved.
kubectl k patch -n tsb managementplanes.install.tetrate.io managementplane --type=merge -p '{"spec": {"highAvailability": {"active": null, "standby": {"activeMpSynchronizationEndpoint": { "host": "MP2_DNS", "port": "443", "selfSigned": true }}}}}'
:::warn Ensure that the value of host points to the correct, active management plane
Ensure that the value of host points to the correct, active management plane. It's good practice to use the permanent FQDN for the correct management plane, rather than the floating active FQDN, so as to avoid complications with DNS propagation.
:::
Troubleshooting
If the failover does not complete, refer to the troubleshooting documentation for next steps.