Skip to main content
logoTetrate Service BridgeVersion: 1.12.x

Failover from one MP to another MP

How to failover from one MP instance to another instance.

Fail over to an alternate Management Plane

The failover process functions by updating the DNS address that identifies the Management Plane location, so that it points to the new Management Plane instance. Clients will start using the new instance when they re-resolve the DNS name.

Ensure that:

  • You are able to update the DNS address that identifies the Management Plane location
  • The new Management Plane has up-to-date configuration and is ready to take control
  1. Shutdown the current Management Plane instance and activate the new Management Plane instance

    If necessary, shutdown the current Management Plane so that it does not receive configuration updates:

    On the current Management Cluster
    kubectl scale deploy -n tsb tsb iam --replicas 0

    Similarly, start the new Management Plane so that it can receive configuration updates:

    On the new Management Cluster
    kubectl scale deploy -n tsb tsb iam --replicas 1

    Suspend the restore job (if present) on the new Cluster so that it does not attempt to write to the Postgres database:

    On the new Management Cluster
    kubectl patch cronjobs tsb-restore -n tsb -p '{"spec" : {"suspend" : true }}'
  2. Verify that the new Management Plane is ready to take control

    Log in to the new Management Plane UI:

    • Verify that your Tetrate configuration is present in the Postgres database; look for cluster configurations (clusters will not have synced at this point) and the organizational structure (organization, tenants, workspaces) that you expect to see
    • Check the Elastic historical data if available (if expected)
  3. Update the DNS Record to point to the new Management Plane

    Update the DNS record that you use you identify the Management Plane location, making it point to the new IP address for the new Management Plane instance.

    Propagation may take time. Once the change has propagated, verify that you can access the Management Plane UI using the updated FQDN address.

  4. Provoke each Edge cluster to re-connect to the new Management Plane

    If possible, shut down the envoy service on the old Management Plane instance:

    On the old Management Cluster
    kubectl scale deploy -n tsb envoy --replicas 0

    This should be sufficient to provoke each Edge cluster to re-connect to the new Management Plane.

    Force a reconnect

    If you need to manually force an Edge cluster to reconnect, restart the edge deployment to re-resolve the management plane IP address. This will provoke the cluster to begin using the new, working instance rather than the previous instance.

    Switch to each workload cluster and restart the edge deployment:

    kubectl rollout restart deployment -n istio-system edge
  5. Validate Status in TSB UI

    Go to the TSB UI and review the Clusters page. When each workload cluster connects to the new Management Plane, you will see its status and a last sync timestamp.

  6. Validate Status using tctl

    If preferred, you can use tctl to validate the status of each cluster:

    tctl x status cluster my-cluster-id
    NAME            STATUS    LAST EVENT      MESSAGE
    my-cluster-id READY XCP_ACCEPTED Cluster onboarded

With a successful restore of a new Management Plane, you will have fully recovered from the failure and your Workload Clusters will be under the control of the new Management Plane instance.

Final Steps

A failover operation is a last-resort, if it's not possible to recover the current Management Plane instance quickly. The old, failed Management Plane will contain a snapshot of the previous configuration and should not be reused.

You may wish to deploy another standby Management Plane for your newly-active instance, and prepare to perform the failover operation again should your new Management Plane instance ever fail.