Tetrate Service BridgeVersion: next

Manual Backup and Restore from Active to Standby

How to manually backup the Active configuration, and restore to the Standby instance, using the embedded Postgres database.

The solution provided here is an alternative to the automated synchronization approach.

Monitor backups from the active Management Plane instance

You should take regular backups of the active Management Plane instance.

Backup the Existing Active Database
Follow the admin instructions to back up the PostgreSQL database:
- For an External database, use pg_dump to take a backup
- For an Embedded database, take a copy of a recent backup
Copy the backup to the desired storage location, for example, your local machine.
Prevent Updates to the Database
If the Management Plane that uses the database is active and receiving changes, you can make it inactive to prevent changes to the database. You should do this if you plan to deactivate this management plane, so that configuration changes are rejected rather than accepted and lost.
On the active Management cluster
kubectl scale deploy -n tsb tsb iam --replicas 0

Prepare a standby Management Plane instance

You can deploy a standby Management Plane instance and perform regular database restores to this instance, so that you can fail over quickly, or you can install the Management Plane instance on demand and import the most recent database backup.

When you deploy the standby Management Plane, it must have the same configuration so that it can smoothly take control in place of the failed, Active instance:

Management Plane Deployment: Both MPs are installed using Helm with the same values file.
PostgreSQL Configuration: The PostgreSQL secret must match in each Management Plane instance
Certificates & Secrets: Both MPs use the same set of certificates, the same iam-signing-key secret, and the same authentication tokens for the Elastic database.

Option 1: Deploy a Replica Management Plane instance using Helm
Use the helm-based deployment when the active Management Plane was originally deployed using helm (the recommended method).
In the Kubernetes cluster of the original 'active' Management Plane:
1. Take a snapshot of the operational Kubernetes secrets. These secrets were auto-generated on first use:
```
kubectl get secrets -n tsb -o yaml iam-signing-key > source_mp_operational_secrets.yaml
```
In the Kubernetes cluster intended for the new 'standby' Management Plane:
1. Create a k8s namespace for the replica MP:
```
kubectl create ns tsb
```
1. Apply the operational secrets from the the original Management Plane instance:
```
kubectl apply -n tsb -f source_mp_operational_secrets.yaml
```
1. Install the replica Management Plane using the same Helm values that were used for the original Management Plane:
```
helm install mp tetrate-tsb-helm/managementplane \
  --version <tsb-version> \
  --namespace tsb  \
  --values source_mp_values.yaml \
  --timeout 10m \
  --set image.registry=<registry-location>
```
Ensure that the front Envoy certificate and key, and the root CA and key are provided, for example through the Helm values.
Option 2: Generic deployment model
Use the 'generic' deployment method when the active Management Plane was originally deployed using the tctl CLI.
In the Kubernetes cluster of the original 'active' Management Plane:
1. Take a snapshot of the configurational and operational Kubernetes secrets. These secrets were auto-generated on first use:
```
kubectl get secrets -n tsb -o yaml admin-credentials azure-credentials custom-host-ca elastic-credentials es-certs iam-oidc-client-secret ldap-credentials postgres-credentials tsb-certs xcp-central-cert iam-signing-key > source_mp_all_secrets.yaml
```
In the Kubernetes cluster intended for the new 'standby' Management Plane:
1. Create a k8s namespace for the replica MP:
```
kubectl create ns tsb
```
1. Apply the secrets from the the original Management Plane instance:
```
kubectl apply -n tsb -f source_mp_all_secrets.yaml
```
1. Install the replica Management Plane using helm:
```
helm install mp tetrate-tsb-helm/managementplane \
  --version <tsb-version> \
  --namespace tsb  \
  --values dr_mp_values.yaml \
  --timeout 10m \
  --set image.registry=<registry-location>
```
... where dr_mp_values.yaml:
- Should include the spec field
- Should NOT include the secrets field (as secrets were installed in the previous step)
Disable the standby Management Plane (optional)
You may wish to disable the standby Management Plane so that it does not accidentally receive configuration updates that are intended for the Active instance:
```
kubectl scale deploy -n tsb tsb iam --replicas 0
```
Limited Health Checks
Note that if you do this, you will not be able to health-check the services within the management plane, as access is gated using the iam service. You can only health-check the envoy service that fronts the Management Plane services.
You will need to reactivate these Management Plane services before starting a failover operation.

Perform a restore to the standby Management Plane

At any point, you can restore the database backup to the standby Management Plane.

Scale down TSB and IAM services on the standby machine
For safety, before performing a restore, you should prevent the Standby Management Plane instance from accepting configuration changes.
Switch context to the Standby Management Plane (MP) cluster (kubectx or similar) then scale down the tsb and iam services:
```
kubectl scale deploy -n tsb tsb iam --replicas 0
```

Create PVC and Pod for Restoration

On the standby Management Plane cluster, create a PersistentVolumeClaim (PVC) and pod to perform the restoration:

cat <<EOF > pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore
  namespace: tsb
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard-rwo
  volumeMode: Filesystem
EOF

kubectl apply -f pvc.yaml

cat <<EOF > restore-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: restore
  namespace: tsb
spec:
  containers:
  - name: alpine
    image: alpine:latest
    command:
      - sleep
      - infinity
    volumeMounts:
    - name: restore-data
      mountPath: /var/lib/restore
      readOnly: false
  volumes:
  - name: restore-data
    persistentVolumeClaim:
      claimName: restore
EOF

kubectl apply -f restore-pod.yaml

Copy Backup File to PVC and Set Permissions

Copy the backup file to the PVC:

kubectl cp tsb-postgres-backup-26_09_2024_02_00_01.gz tsb/restore:/var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01.gz
kubectl exec -n tsb -it restore -- chown root:root /var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01.gz
kubectl exec -n tsb -it restore -- ls -l /var/lib/restore

Run Job to Restore PostgreSQL

Create and apply the restoration job:

cat <<EOF > restore-backup.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: restore-backup
  namespace: tsb
spec:
  backoffLimit: 1
  completions: 1
  parallelism: 1
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: restore
        image: gcr.io/alclass1-ctkl-1/postgres:14.8-alpine3.18
        command:
          - sh
        args:
          - -c
          - |
            set -ex
            echo "Checking for backup file in /var/lib/restore"
            if [ -f /var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01.gz ]; then
              echo "Backup file found, decompressing"
              gzip -d /var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01.gz
            else
              echo "Backup file not found!"
              exit 1
            fi
            echo "Restoring PostgreSQL from the backup"
            psql "host=tsb-postgres dbname=postgres user=tsb password=Tetrate123 sslmode=verify-ca sslcert=/var/lib/postgresql/data/tls/tls.crt sslkey=/var/lib/postgresql/data/tls/tls.key sslrootcert=/var/lib/postgresql/data/tls/ca.crt" -f "/var/lib/restore/tsb-postgres-backup-26_09_2024_02_00_01"
        volumeMounts:
        - mountPath: /var/lib/postgresql/data/tls
          name: tsb-postgres-certs
          readOnly: true
        - mountPath: /var/lib/restore
          name: restore-volume
          readOnly: false
      volumes:
      - name: tsb-postgres-certs
        secret:
          defaultMode: 0600
          secretName: tsb-postgres-certs
      - name: restore-volume
        persistentVolumeClaim:
          claimName: restore  # Reference the same PVC as in the restore pod
          readOnly: false
EOF

kubectl apply -f restore-backup.yaml

Validate the Job

Check the status of the job and logs:

kubectl get jobs -n tsb restore-backup

NAME             STATUS     COMPLETIONS   DURATION   AGE
restore-backup   Complete   1/1           6s         3h58m

kubectl get pod -n tsb restore-backup-pgvrq

NAME                   READY   STATUS      RESTARTS   AGE
restore-backup-pgvrq   0/1     Completed   0          3h59m

kubectl logs -n tsb -l job-name=restore-backup
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
GRANT

Scale TSB and IAM Back Up on the Standby machine
Optionally, scale the TSB and IAM deployments back to normal:
```
kubectl scale deploy -n tsb tsb iam --replicas 1
```
This will allow you to monitor the health of the standby Management Plane.

Fail over to the Standby Management Plane

The failover process functions by updating the DNS address that identifies the Management Plane location, so that it points to the new Management Plane instance. Clients will start using the new instance when they re-resolve the DNS name.

Ensure that:

You are able to update the DNS address that identifies the Management Plane location
The standby Management Plane has up-to-date configuration and is ready to take control

Shutdown the active Management Plane and activate the standby Management Plane
If necessary, shutdown the active Management Plane so that it does not receive configuration updates:
```
kubectl scale deploy -n tsb tsb iam --replicas 0
```
Similarly, start the standby Management Plane so that it can receive configuration updates:
```
kubectl scale deploy -n tsb tsb iam --replicas 1
```
Suspend the restore job (if present) on the standby Cluster so that it does not attempt to write to the Postgres database:
```
kubectl patch cronjobs tsb-restore -n tsb -p '{"spec" : {"suspend" : true }}'
```
Verify that the Standby Management Plane is ready to take control
Log in to the Standby Management Plane UI:
- Verify that your Tetrate configuration is present in the Postgres database; look for cluster configurations (clusters will not have synced at this point) and the organizational structure (organization, tenants, workspaces) that you expect to see
- Check the Elastic historical data if available (if expected)
Update the DNS Record to point to the Standby Management Plane
Update the DNS record that you use you identify the Management Plane location, making it point to the new IP address for the Standby Management Plane instance.
Your Standby Management Plane instance is now your new Active Management Plane instance.
Restart the Edge Deployment in each Workload Cluster
Restart the edge deployment to re-resolve the management plane IP address, so that your clusters begin using the working Standby instance rather than the failed Active instance.
Switch to each workload cluster and restart the edge deployment:
```
kubectl rollout restart deployment -n istio-system edge
```
Validate Status in TSB UI
Go to the TSB UI and review the Clusters page. When each workload cluster connects to the new Management Plane, you will see its status and a last sync timestamp.
Validate Status using tctl
If preferred, you can use tctl to validate the status of each cluster:
tctl x status cluster my-cluster-id
```
NAME            STATUS    LAST EVENT      MESSAGE
my-cluster-id   READY     XCP_ACCEPTED    Cluster onboarded
```

Final Steps

A failover operation is a last-resort, if it's not possible to recover the active Management Plane instance quickly. For this reason, you should delete the old, failed Management Plane and do not attempt to reuse it.

You may wish to deploy another standby Management Plane for your newly-active instance, and prepare to perform the failover operation again should your new Management Plane instance ever fail.

Monitor backups from the active Management Plane instance​

Backup the Existing Active Database​

Prepare a standby Management Plane instance​

Option 1: Deploy a Replica Management Plane instance using Helm​

Option 2: Generic deployment model​

Disable the standby Management Plane (optional)​

Perform a restore to the standby Management Plane​

Scale down TSB and IAM services on the standby machine​

Create PVC and Pod for Restoration​

Copy Backup File to PVC and Set Permissions​

Run Job to Restore PostgreSQL​

Validate the Job​

Scale TSB and IAM Back Up on the Standby machine​

Fail over to the Standby Management Plane​

Shutdown the active Management Plane and activate the standby Management Plane​

Verify that the Standby Management Plane is ready to take control​

Update the DNS Record to point to the Standby Management Plane​

Restart the Edge Deployment in each Workload Cluster​

Validate Status in TSB UI​

Validate Status using tctl​

Final Steps​

Monitor backups from the active Management Plane instance

Backup the Existing Active Database

Prepare a standby Management Plane instance

Option 1: Deploy a Replica Management Plane instance using Helm

Option 2: Generic deployment model

Disable the standby Management Plane (optional)

Perform a restore to the standby Management Plane

Scale down TSB and IAM services on the standby machine

Create PVC and Pod for Restoration

Copy Backup File to PVC and Set Permissions

Run Job to Restore PostgreSQL

Validate the Job

Scale TSB and IAM Back Up on the Standby machine

Fail over to the Standby Management Plane

Shutdown the active Management Plane and activate the standby Management Plane

Verify that the Standby Management Plane is ready to take control

Update the DNS Record to point to the Standby Management Plane

Restart the Edge Deployment in each Workload Cluster

Validate Status in TSB UI

Validate Status using tctl

Final Steps