Skip to main content
logoTetrate Service BridgeVersion: 1.12.x

Failover from Active to Standby

How to automate backup-and-restore from Active to Standby, using the embedded Postgres database.

The failover process from Active to Standby involves several steps:

  • Disable the tsb and iam services on the Active instance so that it cannot receive further configuration updates
  • Promote the standby instance to become the new active instance
  • Update the floating FQDN that identifies the Management Plane location so that it points to the new Management Plane instance
  • Provoke the control plane clients to failover to the newly-active Management Plane instance

Once you have performed a failover, you will likely want to prepare a new standby instance. If the failed, previously-active instance is undamaged, this can be configured to be a new standby.

Ensure that:

  • You are able to update the floating FQDN DNS address that identifies the Management Plane location. You will change it to be a CNAME for the new Management Plane FQDN

  • The IAM signing key (iam-signing-key) is present and matches the one in the active management plane cluster. If the active cluster is not available, you can check against the kid field of the JWT tokens used by control plane cluster's components

    Get the iam-signing-key material
    kubectl get secrets iam-signing-key -o yaml | yq .data.kid | base64 -d
  • The XCP Central key (tsb-iam-jwks) is present, and matches the one in the active management plane cluster (if possible to check).

    Get the tsb-iam-jwks material
    kubectl get secrets tsb-iam-jwks -o yaml | yq .data.kid | base64 -d

Perform the failover

Perform the failover as follows:

  1. Shutdown the active Management Plane

    As a precaution, if the active Management Plane is running, shut it down so that it does not process configuration updates:

    kubectl scale deploy -n tsb tsb iam --replicas 0

    After this change, you will not be able to access the UI or API on this management plane.

  2. Activate the standby Management Plane

    Reconfigure the standby Management Plane so that it can receive configuration updates:

    On the standby Management Cluster
    kubectl patch -n tsb managementplanes.install.tetrate.io managementplane --type=merge \
    -p '{"spec": {"highAvailability": {"active": {"exposeEmbeddedPostgresInFrontEnvoy": true}, "standby": null}}}'

    Wait for the TSB operator to reconfigure the management plane and start the TSB services:

    On the standby Management Cluster
    kubectl -n tsb wait --for=condition=available --timeout=600s deployment/tsb

    The standby Management Plane is now running and ready to function.

  3. Verify that the standby Management Plane is ready to take control

    Log in to the standby Management Plane UI using the permanent FQDN for that Management Plane.

    Verify that your Tetrate configuration is present in the Postgres database. Look for the cluster configurations (clusters will not have synced at this point) and the organizational structure (organization, tenants, workspaces) that you expect to see.

  4. Update the floating DNS Record to point to the standby Management Plane

    Update the floating FQDN (DNS record) that you use you identify the Management Plane location. Make it a CNAME for the permanent FQDN for the newly-active Management Plane instance.

    Also update the permanent FQDN for the newly-active Management Plane instance

    Verify that the permanent FQDN for the newly-active Management Plane instance is correct. Depending on your Kubernetes environment, the EXTERNAL-IP (or DNS) for the envoy service may have changed when the Management Plane instance was promoted to be active.

    kubectl get svc -n tsb envoy
    # NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    # envoy LoadBalancer 10.100.204.204 9d9a4612bf5.eu-west-2.elb.amazonaws.com 443:30483/TCP,5432:32641/TCP 3m

    Allow time for the DNS change to propagate before proceeding.

  5. Trigger each Edge deployment to reconnect to the new Management Plane

    Once the DNS change has propagated, you can trigger all Edge deployments to reconnect by terminating the front envoy on the old Management Plane:

    On the old Management Cluster
    kubectl scale deploy -n tsb envoy --replicas 0

    This will disconnect the Control Plane Edges from the old Management plane; they will attempt to re-resolve the DNS and re-connect. At this point, they should then connect to the newly-active Management Plane.

  6. Validate Status in TSB UI

    Using the permanent FQDN for the new Management Plane, go to the TSB UI and review the Clusters page. When each workload cluster connects to the new Management Plane, you will see its status and a last sync timestamp.

    Use the permanent FQDN when testing

    When testing and debugging, always use the permanent FQDN to make sure you are connecting to the correct Management Plane!

  7. Validate Status using tctl

    If preferred, you can use tctl to validate the status of each cluster.

    First, reconfigure tctl to talk to the correct management Plane cluster.

    You can list the workload clusters (tctl get cluster) and inspect the status of each:

    tctl status cluster my-cluster-id
    NAME            STATUS    LAST EVENT      MESSAGE
    my-cluster-id READY XCP_ACCEPTED Cluster onboarded

Prepare the new Standby

If you initiated a failover because of a failure within the previously-active Tetrate Management Plane, you should then delete that instance on the basis that it was not recoverable, so cannot be reused.

You can follow the installation instructions to install a new, standby Management Plane instance.

Reuse the previously-active Management Plane instance

If you initiated a failover because of a failure external to the Tetrate Management Plane, such as a network connectivity issue, you may be able to re-use the installation as a new standby instance once the failure is resolved.

kubectl k patch -n tsb managementplanes.install.tetrate.io managementplane --type=merge -p '{"spec": {"highAvailability": {"active": null, "standby": {"activeMpSynchronizationEndpoint": { "host": "MP2_DNS", "port": "443", "selfSigned": true }}}}}'

:::warn Ensure that the value of host points to the correct, active management plane

Ensure that the value of host points to the correct, active management plane. It's good practice to use the permanent FQDN for the correct management plane, rather than the floating active FQDN, so as to avoid complications with DNS propagation.

:::

Troubleshooting

If the failover does not complete, refer to the troubleshooting documentation for next steps.