Tetrate Service BridgeVersion: 1.10.x

Switch Over from an Active to a Standby Management Plane

For situations where High Availability (HA) of the Tetrate Management Plane is essential, you can run an active/standby pair of Management Plane instances. This guide provides an overview of the process, and you should refer to Tetrate Technical Support for assistance with this procedure.

This document describes how to set up for failover from one 'active' Management Plane deployment to another 'standby' instance. Although existing client-to-Management-Plane connections will be interrupted, clients will be able to reconnect and use the failover Management Plane instance without explicit reconfiguration.

The backup Management Plane instance runs in a mostly 'warm' configuration. If necessary, you may need to manually sync the current database contents from the active instance to the standby instance before performing failover. Failover is then performed by updating the DNS record for the MP instance to point to the new (warm) instance.

During failover, the following communication flows will remain resilient:

Control Planes to Management Plane
tctl CLI to Management Plane API
Access to the Management Plane UI

Prepare for Failover

Certificates

To support failover from the active Management Plane instance to the standby instance, first ensure that the internal root CA certificate is shared between both instances.

Do not use an auto-generated (e.g., by cert-manager) Root CA in each MP instance to procure TLS certificates for TSB, XCP, ElasticSearch and Postgres. If you have to use a self-signed CA, provide it explicitly as part of each Management Plane deployment.

caution

If the active and standby Management Planes both use an auto-generated Root CA, they will eventually get out-of-sync once one of MP(s) rotates its Root CA. As a result, clients will not be able to failover from the active to the standby Management Plane instance without additional configuration.

If possible, use a well-known 3rd party CA to procure TLS certificates for TSB, XCP, ElasticSearch and Postgres. Ensure that if a CA is rotated, this change is propagated to all Management Planes and Control Planes.

If you don't follow the above guidelines, limited High-Availabiltiy is still possible. To ensure seamless client failover, you need to set up an internal procedure to keep the certificate configuration of both k8s clusters in sync.

Configuration

Tetrate Management Plane configuration is stored in a Postgres database. You have two options:

a) Share the Database between installations: Both Management Plane instances can refer to the same, external Postgres Database. This removes the need to maintain a separate, standby database instance and ensures that failover can be quick if needed, but it does introduce another single point of failure. You could also take regular backups of the shared database so that you can restore it if needed, using the instructions below.

b) Each MP installation has a dedicated database: This is the standard behavior with the 'embedded database' configuration, where the management plane takes regular backups of the database. In the event if a failure, follow the instructions below to import the database backup to the new database instance.

Deploy a Replica Management Plane instance

These instructions begin with a single Management Plane instance, which is designated as the 'active' instance. A replica (copy) is deployed, which will act as the 'standby' instance in the event that the 'active' instance becomes unavailable.

Option A: Helm-based deployment model

Use the helm-based deployment when the active Management Plane was originally deployed using helm (the recommended method).

In the Kubernetes cluster of the original 'active' Management Plane:

Take a snapshot of the operational Kubernetes secrets. These secrets were auto-generated on first use:

kubectl get secrets -n tsb -o yaml iam-signing-key > source_mp_operational_secrets.yaml

In the Kubernetes cluster intended for the new 'standby' Management Plane:

Create a k8s namespace for the replica MP:

kubectl create ns tsb

Apply the operational secrets from the the original Management Plane instance:

kubectl apply -n tsb -f source_mp_operational_secrets.yaml

Install the replica Management Plane using the same Helm values that were used for the original Management Plane:

helm install mp tetrate-tsb-helm/managementplane \
  --version <tsb-version> \
  --namespace tsb  \
  --values source_mp_values.yaml \
  --timeout 10m \
  --set image.registry=<registry-location>

Option B: Generic deployment model

Use the 'generic' deployment method when the active Management Plane was originally deployed using the tctl CLI.

In the Kubernetes cluster of the original 'active' Management Plane:

Take a snapshot of the configurational and operational Kubernetes secrets. These secrets were auto-generated on first use:

kubectl get secrets -n tsb -o yaml admin-credentials azure-credentials custom-host-ca elastic-credentials es-certs iam-oidc-client-secret ldap-credentials postgres-credentials tsb-certs xcp-central-cert iam-signing-key > source_mp_all_secrets.yaml

In the Kubernetes cluster intended for the new 'standby' Management Plane:

Create a k8s namespace for the replica MP:

kubectl create ns tsb

Apply the secrets from the the original Management Plane instance:

kubectl apply -n tsb -f source_mp_all_secrets.yaml

Install the replica Management Plane using helm:

helm install mp tetrate-tsb-helm/managementplane \
  --version <tsb-version> \
  --namespace tsb  \
  --values dr_mp_values.yaml \
  --timeout 10m \
  --set image.registry=<registry-location>

... where dr_mp_values.yaml:

Should include the spec field
Should NOT include the secrets field (as secrets were installed in the previous step)

Restore the Management Plane Database contents

If the two management plane instances use different Postgres databases, you will need to backup and restore the database contents from the active to the failover instance.

Using a shared database?

If you are using a external database for the active Management Plane instance, and it is functioning correctly, you can use that instance for the standby Management Plane. There is no need to backup-and-restore to perform a failover.

You are advised to maintain regular backups in the event of a catastrophic database failure.

Backup the Existing Active Database
Follow the admin instructions to back up the PostgreSQL database:
- For an External database, use pg_dump to take a backup
- For an Embedded database, take a copy of a recent backup
Copy the backup to the desired storage location, for example, your local machine.
Scale down TSB and IAM services on the standby machine
Switch to Standby Cluster and Scale Down TSB and IAM:
Switch context to the Standby Management Plane (MP) cluster (kubectx or similar) then scale down the tsb and iam services:
```
kubectx standby-mp
kubectl scale deploy -n tsb tsb iam --replicas 0
```
This step is necessary to ensure no database writes occur during restoration

Create PVC and Pod for Restoration

Create a PersistentVolumeClaim (PVC) and pod to facilitate the restoration:

cat <<EOF > pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore
  namespace: tsb
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard-rwo
  volumeMode: Filesystem
EOF

kubectl apply -f pvc.yaml

cat <<EOF > restore-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: restore
  namespace: tsb
spec:
  containers:
  - name: alpine
    image: alpine:latest
    command:
      - sleep
      - infinity
    volumeMounts:
    - name: restore-data
      mountPath: /var/lib/restore
      readOnly: false
  volumes:
  - name: restore-data
    persistentVolumeClaim:
      claimName: restore
EOF

kubectl apply -f restore-pod.yaml

Copy Backup File to PVC and Set Permissions

Copy the backup file to the PVC:

kubectl cp tsb-postgres-backup-26_09_2024_02_00_01.gz tsb/restore:/var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01.gz
kubectl exec -n tsb -it restore -- chown root:root /var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01.gz
kubectl exec -n tsb -it restore -- ls -l /var/lib/restore

Run Job to Restore PostgreSQL

Create and apply the restoration job:

cat <<EOF > restore-backup.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: restore-backup
  namespace: tsb
spec:
  backoffLimit: 1
  completions: 1
  parallelism: 1
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: restore
        image: gcr.io/alclass1-ctkl-1/postgres:14.8-alpine3.18
        command:
          - sh
        args:
          - -c
          - |
            set -ex
            echo "Checking for backup file in /var/lib/restore"
            if [ -f /var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01.gz ]; then
              echo "Backup file found, decompressing"
              gzip -d /var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01.gz
            else
              echo "Backup file not found!"
              exit 1
            fi
            echo "Restoring PostgreSQL from the backup"
            psql "host=tsb-postgres dbname=postgres user=tsb password=Tetrate123 sslmode=verify-ca sslcert=/var/lib/postgresql/data/tls/tls.crt sslkey=/var/lib/postgresql/data/tls/tls.key sslrootcert=/var/lib/postgresql/data/tls/ca.crt" -f "/var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01"
        volumeMounts:
        - mountPath: /var/lib/postgresql/data/tls
          name: tsb-postgres-certs
          readOnly: true
        - mountPath: /var/lib/restore
          name: restore-volume
          readOnly: false
      volumes:
      - name: tsb-postgres-certs
        secret:
          defaultMode: 0600
          secretName: tsb-postgres-certs
      - name: restore-volume
        persistentVolumeClaim:
          claimName: restore  # Reference the same PVC as in the restore pod
          readOnly: false
EOF

kubectl apply -f restore-backup.yaml

Validate the Job

Check the status of the job and logs:

kubectl get jobs -n tsb restore-backup
NAME             STATUS     COMPLETIONS   DURATION   AGE
restore-backup   Complete   1/1           6s         3h58m

kubectl get pod -n tsb restore-backup-pgvrq
NAME                   READY   STATUS      RESTARTS   AGE
restore-backup-pgvrq   0/1     Completed   0          3h59m

kubectl logs -n tsb -l job-name=restore-backup
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
GRANT

Scale TSB and IAM Back Up on the Standby machine
Scale the TSB and IAM deployments back to normal:
```
kubectl scale deploy -n tsb tsb iam --replicas 1
```

Perform the Failover to the Standby Management Plane

Update DNS Record for Standby MP IP
Update the DNS record that you use you identify the Management Plane location, making it point to the new IP address for the Standby Management Plane instance.
Your Standby Management Plane instance is now your new Active Management Plane instance.
Restart the Edge Deployment in each Workload Cluster
Restart the edge deployment to re-resolve the management plane IP address, so that your clusters begin using the working Standby instance rather than the failed Active instance.
Switch to each workload cluster and restart the edge deployment:
```
kubectl rollout restart deployment -n istio-system edge
```
Validate Status in TSB UI
Go to the TSB UI and review the Clusters page. When each workload cluster connects to the new Management Plane, you will see its status and a last sync timestamp.
Validate Status using tctl
If preferred, you can use tctl to validate the status of each cluster:
tctl x status cluster my-cluster-id
```
NAME            STATUS    LAST EVENT      MESSAGE
my-cluster-id   READY     XCP_ACCEPTED    Cluster onboarded
```

Prepare for Failover​

Certificates​

Configuration​

Deploy a Replica Management Plane instance​

Option A: Helm-based deployment model​

Option B: Generic deployment model​

Restore the Management Plane Database contents​

Backup the Existing Active Database​

Scale down TSB and IAM services on the standby machine​

Create PVC and Pod for Restoration​

Copy Backup File to PVC and Set Permissions​

Run Job to Restore PostgreSQL​

Validate the Job​

Scale TSB and IAM Back Up on the Standby machine​

Perform the Failover to the Standby Management Plane​

Update DNS Record for Standby MP IP​

Restart the Edge Deployment in each Workload Cluster​

Validate Status in TSB UI​

Validate Status using tctl​

Prepare for Failover

Certificates

Configuration

Deploy a Replica Management Plane instance

Option A: Helm-based deployment model

Option B: Generic deployment model

Restore the Management Plane Database contents

Backup the Existing Active Database

Scale down TSB and IAM services on the standby machine

Create PVC and Pod for Restoration

Copy Backup File to PVC and Set Permissions

Run Job to Restore PostgreSQL

Validate the Job

Scale TSB and IAM Back Up on the Standby machine

Perform the Failover to the Standby Management Plane

Update DNS Record for Standby MP IP

Restart the Edge Deployment in each Workload Cluster

Validate Status in TSB UI

Validate Status using tctl