Skip to main content
logoTetrate Service BridgeVersion: next

Automatic Synchronization from Active to Standby

How to automate backup-and-restore from Active to Standby, using the embedded Postgres database.

Please note

This capability is released as 'Beta' and implementations may change. Please work with Tetrate Technical support if you wish to implement this in a production environment

The solution provided here configures a standby Management Plane instance to actively pull database changes from an active instance. If the active instance were to fail for some reason, and cannot easily be recovered, then the standby instance is ready to become the new Management Plane. The failover process is a manual one, and should be performed as a last resort.

Embedded or External Postgres Database?

These instructions are relevant if you use the embedded Postgres database in each Management Plane. Please refer to Tetrate Technical Support if you plan to follow them, as these general instructions may need to be adapted for your local environment.

Alternatively:

  • If you use external Postgres databases, dedicated to each Management Plane instance, you may build your own backup-and-restore or synchronization process. You can use the documentation here as guidelines for installing the Management Planes, and refer to your Postgres documentation to understand how best to replicate from active to standby.

  • If you use an external Postgres database that is shared between the active and standby Management Plane, you do not need to sync configuration, but you should maintain regular backups.

The Workflow - Deploy new Active and Standby Management Planes

The workflow explains how to deploy a new active Management Plane from scratch, and then deploy a standby Management Plane. It explains the certificate hierarchy you will need, and how to use DNS for the failover:

  • Permanent FQDNs (DNS names) are used for each Management Plane instance, for example: mp1.tsb.example.com, mp2.tsb.example.com, mp3.tsb.example.com etc.
  • A Floating FQDN (DNS name) is used to access the currently-active Management plane, for example: active.tsb.example.com. This name is a CNAME for the active Management Plane FQDN, and is updated when you fail over to a new Management Plane
Other failover methods are possible

Using a floating FQDN is a convenient way to perform failover, and is documented in this example. Other failover methods are possible, such as to deploy the Management Plane instances behind a load balancer, and to perform failover from active to standby by changing the load balancer routes.

Also note that you can avoid the need to run active and standby instances of the Tetrate Management Plane by instead using an external Postgres database with appropriate redundancy/availability, and taking regular backups of both the Postgres configuration and the secrets needed to re-install the management plane.

Adding a standby to an existing Management Plane

If you are currently running a Tetrate Management Plane, and wish to deploy a standby instance, please refer to Tetrate technical support. They can advise on the most appropriate strategy, either updating the existing Management Plane with the new certificate hierarchy and secrets, or deploying a fresh Management Plane and importing the configuration.

Deploy the Active Management Plane instance

You should first create an appropriate certificate hierarchy for the active Management Plane. You can create the certificates for the standby instance now, or at a later stage.

You then deploy the active Management Plane, configure the DNS names, and start to onboard clusters.

Finally, prepare for the addition of a standby Management Plane by configuring the active Management Plane to expose the embedded Postgres database to appropriately authenticated connections.

  1. Create the PKI (certificate) hierarchy

    Automated Certificate Management

    These instructions use TSB's certificate automation to generate certificates. Review this document before proceeding, and adapt the instructions if you are using a different way to generate and manage certificates.

    Take a note of the intended DNS FQDNs for the active Management Plane:

    • The permanent FQDN, such as mp1.tsb.example.com
    • The floating FQDN, such as active.tsb.example.com

    Create the following certificate/key pairs:

    • The root CA, which is used to issue additional TSB certificates and forms the root of trust
    • The Management Plane front-envoy cert (tsb-certs), signed by the root CA. This cert must contain two SANs, for the permanent and floating FQDNs
    • The XCP Central leaf cert (xcp-central-cert), signed by the root CA. This cert must also contain two SANs, for the permanent and floating FQDNs

    At this point, you may also wish to create the front-envoy and xcp central leaf certificate/key pairs for the standby Management Plane, e.g. mp2.tsb.example.com. These should also contain two SANs, for the appropriate permanent and floating FQDNs.

  2. Install the active Management Plane instance

    Install the active Management Plane using helm. The helm values file must contain the following settings:

    TSB Secrets
    secrets:
    tsb:
    cert: |
    $(awk '{printf " %s\n", $0}' < tsb_certs.crt)
    key: |
    $(awk '{printf " %s\n", $0}' < tsb_certs.key)
    xcp:
    autoGenerateCerts: false
    central:
    cert: |
    $(awk '{printf " %s\n", $0}' < xcp-central-cert.crt)
    key: |
    $(awk '{printf " %s\n", $0}' < xcp-central-cert.key)
    rootca: |
    $(awk '{printf " %s\n", $0}' < ca.crt)

    ... along with other secrets and values you require.

    Expose Postgres
    spec:
    highAvailability:
    active:
    exposeEmbeddedPostgresInFrontEnvoy: true

    The active.exposeEmbeddedPostgresInFrontEnvoy setting will expose the embedded Postgres instance on port 5432, so that the Standby instance can connect and synchronize. Clients will need to use the service account created later to bootstrap their access and obtain the necessary credentials.

  3. DNS name for Management Plane endpoint

    Configure the following DNS entries for your active Management Plane:

    • The Permanent FQDN mp1.tsb.example.com must resolve to the external IP address of the envoy service in the Management Plane installation
    • The Floating FQDN active.tsb.example.com must be a CNAME for the permanent FQDN mp1.tsb.example.com

    Verify that you can log in to the Management Plane instance using both the permanent and floating FQDNs.

  4. Onboard Control Plane clusters

    You can now begin onboarding Control Plane clusters to the active Management Plane.

    When you do so, use the following values in your helm values:

    Values for the Control Plane installation
    secrets:
    elasticsearch:
    cacert: |
    $(awk '{printf " %s\n", $0}' < ca.crt)
    tsb:
    cacert: |
    $(awk '{printf " %s\n", $0}' < ca.crt)
    xcp:
    rootca: |
    $(awk '{printf " %s\n", $0}' < ca.crt)
    spec:
    managementPlane:
    clusterName: $CLUSTER_NAME
    host: $ACTIVE_FQDN # Important - this must be the floating FQDN, e.g. 'active.tsb.example.com'
    port: 443
    selfSigned: true

    You can also start onboarding services and creating the TSB hierarchy and configuration.

  5. Create a service account with read permissions

    Use tctl or any other approach to create a service account on the active Management Plane. Retain the private key tsb-standby-sa-b64e:

    tctl sa create tsb-standby-sa | base64 -w0 > tsb-standby-sa-b64e

    Grant the service account org-reader permissions:

    Use the appropriate MYORG name for your installation
    tctl get ab organizations/MYORG -o yaml \
    | yq '.spec.allow += [{"role":"rbac/org-reader","subjects":[{"serviceAccount":"organizations/MYORG/serviceaccounts/tsb-standby-sa"}]}]' \
    | tctl apply -f -

Your active Management Plane is ready to use, and you can proceed at any point to install and configure a standby Management Plane instance.

Install a standby Management Plane instance

Next, install a standby Management Plane instance to synchronise its configuration from the active instance. This standby instance will run in a limited mode, with no UI or API to manage configuration.

The standby instance will use a different permanent FQDN, such as mp2.tsb.example.com.

  1. Generate the Certificates for the standby Management Plane

    If you did not generate additional certificates earlier, you should generate them now.

    Make note of the intended permanent FQDN for the standby Management Plane.

    Using the previously-generated root CA cert and key, issue two new certificate key/pairs:

    • The Management Plane front-envoy cert (tsb-certs), signed by the root CA. This cert must contain two SANs, for the permanent and floating FQDNs
    • The XCP Central leaf cert (xcp-central-cert), signed by the root CA. This cert must also contain two SANs, for the permanent and floating FQDNs
  2. Import the Secrets

    Before installing the standby Management Plane, acquire the necessary secrets from the active Management Plane:

    Export the secrets from the active management plane
    kubectl get secrets -n tsb -o yaml iam-signing-key > source_mp_operational_secrets.yaml

    Pre-configure the standby Management Plane with these secrets:

    Import to the standby Management Plane
    kubectl create ns tsb
    kubectl apply -f source_mp_operational_secrets.yaml

    Note that the tsb-iam-jwks token is required, but will be generated on the standby Management Plane from the iam-signing-key.

    Using the previously-saved tsb-standby-sa-b64e file, add the tsb-standby-sa service account secret:

    Import to the standby management plane
    TSB_STANDBY_SA_JWK=`cat tsb-standby-sa-b64e`
    kubectl -n tsb apply -f - <<EOF
    apiVersion: v1
    kind: Secret
    metadata:
    name: tsb-standby-sa
    data:
    jwk: $TSB_STANDBY_SA_JWK
    type: Opaque
    EOF
  3. Deploy the Standby instance

    Deploy a standby Management Plane instance into the tsb namespace on the standby cluster. If necessary, first install the same pre-requisites (e.g. cert-manager).

    Start with the same basic helm values as used for the active Management Plane instance, and make two important changes:

    Certificates: Make sure to use the new tsb-certs and xcp-central-cert for this installation.

    Standby Mode: Do not use the spec.highAvailability settings that were added in the active installation. Instead, use the following lines in the spec stanza:

    helm configuration for standby
    spec:
    ...
    highAvailability:
    standby:
    activeMpSynchronizationEndpoint:
    host: $PRIMARY_ENVOY_ENDPOINT # the permanent DNS name or IP address of the active management plane instance
    port: 443 # the port exposed by the front-envoy in the active management plane instance, typically 443
    selfSigned: true # Needed if the cert if selfsigned or signed by a private CA

    Perform the Installation: Install the standby Management Plane instance using helm, to the new standby cluster.

    When installing in Standby mode, only a limited set of services will run. You will not be able to use the API or UI for this management plane. The Postgres database will attempt to synchronize from the nominated Active instance (the $PRIMARY_ENVOY_ENDPOINT).

    Follow the logs from the management plane operator to observe the reconfiguration and catch any errors:

    kubectl logs -n tsb -l name=tsb-operator -f

You have now installed a standby Management Plane instance that is inactive, but is synchronising state with the active Management Plane instance.

Perform a Failover

The failover process functions by updating the floating FQDN that identifies the Management Plane location, so that it points to the new Management Plane instance. ControlPlane Clients will start using the new instance when they re-resolve the DNS name.

Ensure that:

  • You are able to update the floating FQDN DNS address that identifies the Management Plane location. You will change it to be a CNAME for the new Management Plane FQDN

  • The IAM signing key (iam-signing-key) is present and matches the one in the active management plane cluster. If the active cluster is not available, you can check against the kid field of the JWT tokens used by control plane cluster's components

    Get the iam-signing-key material
    kubectl get secrets iam-signing-key -o yaml | yq .data.kid | base64 -d
  • The XCP Central key (tsb-iam-jwks) is present, and matches the one in the active management plane cluster (if possible to check).

    Get the tsb-iam-jwks material
    kubectl get secrets tsb-iam-jwks -o yaml | yq .data.kid | base64 -d

Perform the failover as follows:

  1. Shutdown the active Management Plane

    As a precaution, if the active Management Plane is running, shut it down so that it does not receive configuration updates:

    kubectl scale deploy -n tsb tsb iam --replicas 0

    After this change, you will not be able to access the UI or API on this management plane.

  2. Activate the standby Management Plane

    Reconfigure the standby Management Plane so that it can receive configuration updates:

    On the standby Management Cluster
    kubectl --kubeconfig standby.yaml patch \
    -n tsb \
    managementplanes.install.tetrate.io managementplane \
    -p '{"spec":{"highAvailability":{"standby": null }}}' --type=merge

    Wait for the TSB operator to reconfigure the management plane and start the TSB services:

    On the standby Management Cluster
    kubectl -n tsb wait --for=condition=available --timeout=600s deployment/tsb

    The standby Management Plane is now running and ready to function.

  3. Verify that the standby Management Plane is ready to take control

    Log in to the standby Management Plane UI using the permanent FQDN for that Management Plane.

    Verify that your Tetrate configuration is present in the Postgres database. Look for the cluster configurations (clusters will not have synced at this point) and the organizational structure (organization, tenants, workspaces) that you expect to see.

  4. Update the floating DNS Record to point to the standby Management Plane

    Update the floating FQDN (DNS record) that you use you identify the Management Plane location. Make it a CNAME for the permanent FQDN for the standby Management Plane instance.

    Allow time for the DNS change to propagate before proceeding.

  5. Trigger each Edge deployment to reconnect to the new Management Plane

    Once the DNS change has propagated, you can trigger all Edge deployments to reconnect by terminating the front envoy on the old Management Plane:

    On the old Management Cluster
    kubectl scale deploy -n tsb envoy --replicas 0

    This will disconnect the Control Plane Edges from the old Management plane; they will attempt to re-resolve the DNS and re-connect. At this point, they should then connect to the newly-active Management Plane.

  6. Validate Status in TSB UI

    Using the permanent FQDN for the new Management Plane, go to the TSB UI and review the Clusters page. When each workload cluster connects to the new Management Plane, you will see its status and a last sync timestamp.

    Use the permanent FQDN when testing

    When testing and debugging, always use the permanent FQDN to make sure you are connecting to the correct Management Plane!

  7. Validate Status using tctl

    If preferred, you can use tctl to validate the status of each cluster.

    First, reconfigure tctl to talk to the correct management Plane cluster.

    You can list the workload clusters (tctl get cluster) and inspect the status of each:

    tctl status cluster my-cluster-id
    NAME            STATUS    LAST EVENT      MESSAGE
    my-cluster-id READY XCP_ACCEPTED Cluster onboarded

Troubleshooting

If the failover does not complete, refer to the troubleshooting documentation for next steps.

Additional notes

How does the standby Management Plane connect and synchronize?

These docs present the auto-configuration method, where the standby Management Plane is supplied with the service account details so that it can connect to an auto-configuration API endpoint. The standby then receives the necessary additional credentials and configuration so that the standby Postgres database can securely connect to port 5432 and begin to replicate.

Alternatively, you can configure the standby with the necessary certificates and credentials manually, and then configure the standby database to connect and replicate using embeddedPostgresReplicationSettings.

Final Steps

A failover operation is a last-resort, if it's not possible to recover the active Management Plane instance quickly. In most situations, you should delete the old, failed Management Plane and not attempt to reuse it.

You may wish to deploy another standby Management Plane for your newly-active instance:

  • Update the Active Management Plane CR (kubectl edit -n tsb managementplane/managementplane), setting spec.highAvailability.active.exposeEmbeddedPostgresInFrontEnvoy: true
  • Deploy a new, standby Management Plane instance, following the instructions above