Tetrate Service BridgeVersion: 1.12.x

Failover from Active to Standby

How to automate backup-and-restore from Active to Standby, using the embedded Postgres database.

The failover process from Active to Standby involves several steps:

Disable the tsb and iam services on the Active instance so that it cannot receive further configuration updates
Promote the standby instance to become the new active instance
Update the floating FQDN that identifies the Management Plane location so that it points to the new Management Plane instance
Provoke the control plane clients to failover to the newly-active Management Plane instance

Once you have performed a failover, you will likely want to prepare a new standby instance. If the failed, previously-active instance is undamaged, this can be configured to be a new standby.

Ensure that:

You are able to update the floating FQDN DNS address that identifies the Management Plane location. You will change it to be a CNAME for the new Management Plane FQDN
The IAM signing key (iam-signing-key) is present and matches the one in the active management plane cluster. If the active cluster is not available, you can check against the kid field of the JWT tokens used by control plane cluster's components
Get the iam-signing-key material
```
kubectl get secrets iam-signing-key -o yaml | yq .data.kid | base64 -d
```
The XCP Central key (tsb-iam-jwks) is present, and matches the one in the active management plane cluster (if possible to check).
Get the tsb-iam-jwks material
```
kubectl get secrets tsb-iam-jwks -o yaml | yq .data.kid | base64 -d
```

Perform the failover

Perform the failover as follows:

Shutdown the active Management Plane
As a precaution, if the active Management Plane is running, shut it down so that it does not process configuration updates:
```
kubectl scale deploy -n tsb tsb iam --replicas 0
```
After this change, you will not be able to access the UI or API on this management plane.
Activate the standby Management Plane
Reconfigure the standby Management Plane so that it can receive configuration updates:
On the standby Management Cluster
```
kubectl patch -n tsb managementplanes.install.tetrate.io managementplane --type=merge \
-p '{"spec": {"highAvailability": {"active": {"exposeEmbeddedPostgresInFrontEnvoy": true}, "standby": null}}}'
```
Wait for the TSB operator to reconfigure the management plane and start the TSB services:
On the standby Management Cluster
```
kubectl -n tsb wait --for=condition=available --timeout=600s deployment/tsb
```
The standby Management Plane is now running and ready to function.
Verify that the standby Management Plane is ready to take control
Log in to the standby Management Plane UI using the permanent FQDN for that Management Plane.
Verify that your Tetrate configuration is present in the Postgres database. Look for the cluster configurations (clusters will not have synced at this point) and the organizational structure (organization, tenants, workspaces) that you expect to see.
Update the floating DNS Record to point to the standby Management Plane
Update the floating FQDN (DNS record) that you use you identify the Management Plane location. Make it a CNAME for the permanent FQDN for the newly-active Management Plane instance.
Also update the permanent FQDN for the newly-active Management Plane instance
Verify that the permanent FQDN for the newly-active Management Plane instance is correct. Depending on your Kubernetes environment, the EXTERNAL-IP (or DNS) for the envoy service may have changed when the Management Plane instance was promoted to be active.
kubectl get svc -n tsb envoy # NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE # envoy LoadBalancer 10.100.204.204 9d9a4612bf5.eu-west-2.elb.amazonaws.com 443:30483/TCP,5432:32641/TCP 3m
Allow time for the DNS change to propagate before proceeding.
Trigger each Edge deployment to reconnect to the new Management Plane
Once the DNS change has propagated, you can trigger all Edge deployments to reconnect by terminating the front envoy on the old Management Plane:
On the old Management Cluster
```
kubectl scale deploy -n tsb envoy --replicas 0
```
This will disconnect the Control Plane Edges from the old Management plane; they will attempt to re-resolve the DNS and re-connect. At this point, they should then connect to the newly-active Management Plane.
Validate Status in TSB UI
Using the permanent FQDN for the new Management Plane, go to the TSB UI and review the Clusters page. When each workload cluster connects to the new Management Plane, you will see its status and a last sync timestamp.
Use the permanent FQDN when testing
When testing and debugging, always use the permanent FQDN to make sure you are connecting to the correct Management Plane!
Validate Status using tctl
If preferred, you can use tctl to validate the status of each cluster.
First, reconfigure tctl to talk to the correct management Plane cluster.
You can list the workload clusters (tctl get cluster) and inspect the status of each:
tctl status cluster my-cluster-id
```
NAME            STATUS    LAST EVENT      MESSAGE
my-cluster-id   READY     XCP_ACCEPTED    Cluster onboarded
```

Prepare the new Standby

Option 1: Deploy a new Management Plane instance

If you initiated a failover because of a failure within the previously-active Tetrate Management Plane, you should then delete that instance on the basis that it was not recoverable, so cannot be reused.

You can follow the installation instructions to install a new, standby Management Plane instance.

Option 2: Reuse the previously-active Management Plane instance

If you initiated a failover because of a failure external to the Tetrate Management Plane, such as a network connectivity issue, you may be able to re-use the installation as a new standby instance once the failure is resolved.

kubectl k patch -n tsb managementplanes.install.tetrate.io managementplane --type=merge -p '{"spec": {"highAvailability": {"active": null, "standby": {"activeMpSynchronizationEndpoint": { "host": "MP2_DNS", "port": "443", "selfSigned": true }}}}}'

Ensure that the value of host points to the correct, active management plane

Ensure that the value of host (e.g. MP2_DNS) points to the correct, active management plane. It's good practice to use the permanent FQDN for the correct management plane, rather than the floating active FQDN, so as to avoid complications with DNS propagation.

Troubleshooting

If the failover does not complete, refer to the troubleshooting documentation for next steps.

Perform the failover​

Shutdown the active Management Plane​

Activate the standby Management Plane​

Verify that the standby Management Plane is ready to take control​

Update the floating DNS Record to point to the standby Management Plane​

Trigger each Edge deployment to reconnect to the new Management Plane​

Validate Status in TSB UI​

Validate Status using tctl​

Prepare the new Standby​

Option 1: Deploy a new Management Plane instance​

Option 2: Reuse the previously-active Management Plane instance​

Troubleshooting​

Perform the failover

Shutdown the active Management Plane

Activate the standby Management Plane

Verify that the standby Management Plane is ready to take control

Update the floating DNS Record to point to the standby Management Plane

Trigger each Edge deployment to reconnect to the new Management Plane

Validate Status in TSB UI

Validate Status using tctl

Prepare the new Standby

Option 1: Deploy a new Management Plane instance

Option 2: Reuse the previously-active Management Plane instance

Troubleshooting