Tetrate Service BridgeVersion: 1.13.x

Point-in-Time Recovery (PITR) with Embedded PostgreSQL

This document describes how to configure and perform Point-in-Time Recovery (PITR) for the TSB embedded PostgreSQL database using WAL (Write-Ahead Log) archiving.

note

This page refers to the embedded PostgreSQL installation only. If you are using a dedicated PostgreSQL installation, please refer to the Backup and restore PostgreSQL page.

Overview

Point-in-Time Recovery (PITR) allows you to restore a PostgreSQL database to any specific point in time, providing protection against:

Accidental data deletion
Logical corruption
Application bugs that corrupt data
Unwanted schema changes

PITR works by combining:

Base Backup: A full physical backup of the database cluster created using pg_basebackup
WAL Archives: Continuous archive of all changes (Write-Ahead Logs) made after the base backup

This approach allows you to restore the database to any point in time between the base backup and the end of available WAL archives, or to a specific transaction ID or named restore point.

Architecture

Components

The TSB PostgreSQL PITR setup uses the following components:

Persistent Volume Claims (PVCs):

tsb-backup: Stores base backups created by the backup CronJob
tsb-postgres-wal-archive: Stores archived WAL files continuously

Backup CronJob: Scheduled job that runs pg_basebackup to create base backups
WAL Archiving: PostgreSQL continuously archives WAL files to the dedicated PVC

Data Flow

[Postgres Primary]
      |
      ├─> WAL Files Generated
      |         |
      |         └─> Archived to /var/lib/postgresql/wal-archive/
      |                      (mounted from tsb-postgres-wal-archive PVC)
      |
      └─> Base Backup (scheduled/manual)
                |
                └─> Stored in /var/lib/backup/
                         (mounted from tsb-backup PVC)

Prerequisites

Before you get started make sure:

✓ You have installed and configured TSB.
✓ You have installed and configured kubectl to access the management cluster.
✓ You have administrative access to the TSB namespace.
✓ You have the base64 command available for decoding secrets.

Development Environment

For development and testing purposes, a Makefile with automation scripts is available in the TSB repository at deploy/dev/pitr/Makefile. This documentation provides all the necessary commands inline, so the Makefile is not required for production use.

Configuration

To enable PITR for the embedded PostgreSQL database, configure the following settings in your ManagementPlane specification:

spec:
  dataStore:
    embeddedPostgres:
      name: tsb
      replicas: 3
      # Enable WAL archiving for PITR
      walArchiveStorageSize: 10Gi
      walArchiveStorageClassName: standard
      walArchiveStorageAccessMode: ReadWriteOnce
      # Base backup configuration
      backupStorageSize: 50Gi
      backupStorageClassName: standard
      backupStorageAccessMode: ReadWriteOnce
      backupCronSchedule: "0 2 * * *"  # Daily at 2 AM
      backupCleanupRetention: 432000s # Retain backups for 5 days

Configuration Parameters

Parameter	Description	Default
`walArchiveStorageSize`	Size of the PVC for storing WAL archives	-
`walArchiveStorageClassName`	Storage class for WAL archive PVC	`standard`
`walArchiveStorageAccessMode`	Access mode for WAL archive PVC	`ReadWriteOnce`
`backupStorageSize`	Size of the PVC for storing base backups	-
`backupStorageClassName`	Storage class for backup PVC	`standard`
`backupStorageAccessMode`	Access mode for backup PVC	`ReadWriteOnce`
`backupCronSchedule`	Cron schedule for automatic base backups	`0 2 * * *`
`backupCleanupRetention`	Retention period for old backups (in seconds)	`432000s` (5 days)

Creating Base Backups

Automatic Backups

Check a backupCronSchedule option to determine schedule of base backup. The backup CronJob is named backup-tsb-postgres.

You can verify the CronJob:

kubectl get cronjob -n tsb backup-tsb-postgres

Manual Backups

To create a manual base backup immediately:

BACKUP_DATE=$(date +%s)
kubectl create job -n tsb --from=cronjob/backup-tsb-postgres manual-backup-$BACKUP_DATE

Watch the backup progress:

kubectl wait --for=condition=complete --timeout=300s -n tsb job/manual-backup-$BACKUP_DATE
kubectl logs -n tsb -l job-name=manual-backup-$BACKUP_DATE

Verifying Base Backups

To list available base backups, create a temporary pod to access the backup PVC:

kubectl run -n tsb -it backup-viewer --image=nicolaka/netshoot --rm --restart=Never \
    --overrides='{
      "spec": {
        "containers": [{
          "name": "backup-viewer",
          "image": "nicolaka/netshoot",
          "stdin": true,
          "tty": true,
          "volumeMounts": [{
            "mountPath": "/var/lib/backup",
            "name": "backup-volume"
          }]
        }],
        "volumes": [{
          "name": "backup-volume",
          "persistentVolumeClaim": {
            "claimName": "tsb-backup"
          }
        }]
      }
    }'

Inside the pod, list the backups:

ls -lh /var/lib/backup

You should see directories like tsb-postgres-base-backup-DD_MM_YYYY_HH_MM_SS.

WAL Archiving

Monitoring WAL Archives

To view archived WAL files:

kubectl exec -n tsb tsb-postgres-1-0 -- ls -lh /var/lib/postgresql/wal-archive/

Forcing WAL File Switch

During normal operation, PostgreSQL switches to a new WAL file when the current one fills up (typically 16MB). To force an immediate WAL switch (useful for testing or ensuring recent changes are archived):

# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Force WAL switch
kubectl exec -n tsb tsb-postgres-1-0 -- \
  bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT pg_switch_wal();'"

Wait a few seconds, then verify the new WAL file was archived:

kubectl exec -n tsb tsb-postgres-1-0 -- ls -lh /var/lib/postgresql/wal-archive/

Performing Point-in-Time Recovery

Step 1: Scale Down the Management Plane

Before starting the restore process, scale down TSB components to prevent database write activity:

kubectl scale deployment -n tsb tsb iam --replicas 0

note

Scaling down the TSB deployment only interrupts the ability to change configuration while restoration is in progress. It does not affect the data plane or running services.

Step 2: Scale Down PostgreSQL

Scale down the kubegres controller and PostgreSQL instances:

# First scale kubegres to single replica (removes replica pods)
kubectl patch kubegres -n tsb tsb-postgres --type=merge -p '{"spec":{"replicas":1}}'
kubectl wait -n tsb --for=delete --timeout=120s pod -l replicationRole=replica || true
# Scale down kubegres controller
kubectl scale deployment -n tsb kubegres-controller-manager --replicas=0
kubectl wait -n tsb --for=delete --timeout=120s pod -l app=kubegres-controller-manager
# Scale down primary statefulset
kubectl scale statefulset -n tsb -l replicationRole=primary --replicas=0
kubectl wait -n tsb --for=delete --timeout=120s pod -l replicationRole=primary

Step 3: Deploy Restore Helper Pod

Get the PostgreSQL image version:

export POSTGRES_IMAGE=$(kubectl get kubegres -n tsb tsb-postgres -o jsonpath='{.spec.image}')
echo $POSTGRES_IMAGE

Create a helper pod with access to all required volumes:

cat <<EOF | kubectl apply -n tsb -f -
apiVersion: v1
kind: Pod
metadata:
  name: postgres-restore
  namespace: tsb
spec:
  serviceAccountName: tsb-kubegres
  securityContext:
    runAsUser: 70  # postgres user
    fsGroup: 70
  containers:
  - name: postgres
    image: $POSTGRES_IMAGE
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: postgres-data
      mountPath: /var/lib/postgresql/data
    - name: backup-volume
      mountPath: /var/lib/backup
    - name: wal-archive
      mountPath: /var/lib/postgresql/wal-archive
    - name: tsb-postgres-certs
      mountPath: /var/lib/postgresql/data/tls
      readOnly: true
  volumes:
  - name: postgres-data
    persistentVolumeClaim:
      claimName: postgres-db-tsb-postgres-1-0
  - name: backup-volume
    persistentVolumeClaim:
      claimName: tsb-backup
  - name: wal-archive
    persistentVolumeClaim:
      claimName: tsb-postgres-wal-archive
  - name: tsb-postgres-certs
    secret:
      secretName: tsb-postgres-certs
      defaultMode: 0600
EOF

Wait for the pod to be ready:

kubectl wait -n tsb --for=condition=ready pod/postgres-restore --timeout=120s

Step 4: Prepare the Data Directory

Enter the restore helper pod:

kubectl exec -n tsb -it postgres-restore -- bash

Inside the pod, remove the current data directory:

note

This will delete all existing data in the PostgreSQL data directory. Ensure you have selected the correct base backup before proceeding. Alternatively, you can back up the existing data directory before deletion.

rm -rf /var/lib/postgresql/data/pgdata/*

Step 5: Extract Base Backup

Choose a base backup to restore from (list available backups in /var/lib/backup/):

ls -lh /var/lib/backup/

Extract the chosen base backup:

cd /var/lib/postgresql/data/pgdata
# Replace with your actual backup directory name
BACKUP_DIR="tsb-postgres-base-backup-DD_MM_YYYY_HH_MM_SS"
tar xzf /var/lib/backup/$BACKUP_DIR/base.tar.gz
tar xzf /var/lib/backup/$BACKUP_DIR/pg_wal.tar.gz

Verify the extraction:

ls -lh /var/lib/postgresql/data/pgdata/

This should show the PostgreSQL data files and the pg_wal directory populated with WAL files.

Step 6: Configure Recovery

Create the recovery configuration file:

cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Optional: Recover to specific point in time
# recovery_target_time = '2024-12-04 14:35:00+00'
# Or recover to end of WAL (default if not specified)
RECOVERY_EOF

Create the standby.signal file to trigger recovery mode (PostgreSQL 12+):

touch /var/lib/postgresql/data/pgdata/standby.signal

Verify the recovery configuration:

cat /var/lib/postgresql/data/pgdata/postgresql.auto.conf
ls -la /var/lib/postgresql/data/pgdata/standby.signal

Exit the pod:

exit

Step 7: Clean Up Restore Pod

Delete the restore helper pod:

kubectl delete pod -n tsb postgres-restore

Step 8: Scale Up PostgreSQL

Scale up the primary PostgreSQL instance:

kubectl scale statefulset -n tsb -l replicationRole=primary --replicas=1
kubectl wait -n tsb --for=condition=ready --timeout=180s pod -l replicationRole=primary

Watch the recovery process in the logs:

kubectl logs -n tsb -f tsb-postgres-1-0

Look for log messages indicating recovery progress:

LOG:  starting point-in-time recovery to latest
LOG:  restored log file "000000010000000000000002" from archive
LOG:  redo starts at 0/2000028
LOG:  consistent recovery state reached at 0/20000F8
LOG:  restored log file "000000010000000000000003" from archive

The database will automatically promote itself when recovery is complete (due to recovery_target_action = 'promote'). Watch for these log messages indicating successful promotion:

LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections

Step 9: Verify Database Promotion

Check that the database has been promoted and is accepting read-write connections:

# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Check if database is in recovery mode (should return 'f' for false after promotion)
kubectl exec -n tsb tsb-postgres-1-0 -- \
  bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT pg_is_in_recovery();'"

If the database shows it's still in recovery mode (t for true), you may need to manually promote it:

kubectl exec -n tsb tsb-postgres-1-0 -- \
  bash -c "pg_ctl promote -D /var/lib/postgresql/data/pgdata"

After promotion (automatic or manual), verify the database is ready:

kubectl logs -n tsb tsb-postgres-1-0 --tail=20

You should see:

LOG:  database system is ready to accept connections

Step 10: Scale Up Replicas and Controller

Once the primary is running and recovered, scale up the kubegres controller and replicas:

# Scale up kubegres controller
kubectl scale deployment -n tsb kubegres-controller-manager --replicas=1
kubectl wait -n tsb --for=condition=available --timeout=120s deployment/kubegres-controller-manager
# Scale up replicas
kubectl patch kubegres -n tsb tsb-postgres --type=merge -p '{"spec":{"replicas":3}}'
kubectl wait -n tsb --for=condition=ready --timeout=180s pod -l replicationRole=replica

Step 10: Scale Up Management Plane

Finally, scale up the TSB components:

kubectl scale deployment -n tsb tsb iam --replicas 1

Verification

After the restore is complete, verify that the database is functioning correctly.

First, get the PostgreSQL password:

POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)

Check Database Connectivity

kubectl exec -n tsb tsb-postgres-1-0 -- \
  bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT version();'"

Verify Data

Check that your data is present at the expected recovery point. For example, to list tenants:

kubectl exec -n tsb tsb-postgres-1-0 -- \
  bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT name FROM tenant;'"

Check Replication Status

Verify that replication to replicas is working:

kubectl exec -n tsb tsb-postgres-1-0 -- \
  bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT * FROM pg_stat_replication;'"

Advanced Recovery Options

Recovery to Specific Point in Time

To recover to a specific timestamp instead of the end of available WAL logs, modify the recovery configuration in Step 6:

cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to specific point in time (adjust timezone as needed)
recovery_target_time = '2024-12-04 14:35:00+00'
RECOVERY_EOF

Recovery to Specific Transaction ID

If you know the transaction ID you want to recover to:

cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to specific transaction ID
recovery_target_xid = '12345'
RECOVERY_EOF

Recovery to Named Restore Point

You can create named restore points before critical operations:

# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Create a named restore point
kubectl exec -n tsb tsb-postgres-1-0 -- \
  bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c \"SELECT pg_create_restore_point('before_critical_operation');\""

Then during recovery, use the named restore point:

cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to named restore point
recovery_target_name = 'before_critical_operation'
RECOVERY_EOF

Troubleshooting

Recovery Fails with "requested timeline is not a child of this server's history"

This error indicates a timeline mismatch. Ensure you're using the correct base backup and WAL files from the same timeline.

Solution: Start with a fresh base backup and ensure the WAL archive directory contains all necessary WAL files from that backup forward.

"could not open file: No such file or directory"

This error during recovery means a required WAL file is missing from the archive.

Solution:

Check that WAL archiving was properly configured and running
Verify all WAL files are present in /var/lib/postgresql/wal-archive/
You may need to recover to an earlier point in time where all WAL files are available

Recovery Stops at "consistent recovery state reached" but doesn't complete

If recovery appears to hang after reaching a consistent state, check the recovery logs for errors.

Solution:

Check pod logs: kubectl logs -n tsb tsb-postgres-1-0
Verify the recovery_target_action is set to promote
Ensure all referenced WAL files are accessible

Insufficient Space for WAL Archives

If WAL archive PVC fills up, archiving will fail and PITR will be limited.

Solution:

Increase walArchiveStorageSize in the ManagementPlane spec
Implement WAL archive cleanup policies
Consider moving old archives to external storage

Best Practices

Regular Base Backups: Ensure base backups run regularly (daily recommended)
Monitor WAL Archive Space: Set up alerts for WAL archive PVC usage
Test Recovery Procedures: Periodically test your PITR process in a non-production environment
Document Recovery Points: Before critical operations, create named restore points
Retain Multiple Base Backups: Keep at least 3-5 base backups for flexibility
Archive Offline Copies: Periodically copy base backups and WAL archives to external storage
Monitor Backup Jobs: Set up alerts for backup job failures

Overview​

Architecture​

Components​

Data Flow​

Prerequisites​

Configuration​

Configuration Parameters​

Creating Base Backups​

Automatic Backups​

Manual Backups​

Verifying Base Backups​

WAL Archiving​

Monitoring WAL Archives​

Forcing WAL File Switch​

Performing Point-in-Time Recovery​

Step 1: Scale Down the Management Plane​

Step 2: Scale Down PostgreSQL​

Step 3: Deploy Restore Helper Pod​

Step 4: Prepare the Data Directory​

Step 5: Extract Base Backup​

Step 6: Configure Recovery​

Step 7: Clean Up Restore Pod​

Step 8: Scale Up PostgreSQL​

Step 9: Verify Database Promotion​

Step 10: Scale Up Replicas and Controller​

Step 10: Scale Up Management Plane​

Verification​

Check Database Connectivity​

Verify Data​

Check Replication Status​

Advanced Recovery Options​

Recovery to Specific Point in Time​

Recovery to Specific Transaction ID​

Recovery to Named Restore Point​

Troubleshooting​

Recovery Fails with "requested timeline is not a child of this server's history"​

"could not open file: No such file or directory"​

Recovery Stops at "consistent recovery state reached" but doesn't complete​

Insufficient Space for WAL Archives​

Best Practices​

References​

Overview

Architecture

Components

Data Flow

Prerequisites

Configuration

Configuration Parameters

Creating Base Backups

Automatic Backups

Manual Backups

Verifying Base Backups

WAL Archiving

Monitoring WAL Archives

Forcing WAL File Switch

Performing Point-in-Time Recovery

Step 1: Scale Down the Management Plane

Step 2: Scale Down PostgreSQL

Step 3: Deploy Restore Helper Pod

Step 4: Prepare the Data Directory

Step 5: Extract Base Backup

Step 6: Configure Recovery

Step 7: Clean Up Restore Pod

Step 8: Scale Up PostgreSQL

Step 9: Verify Database Promotion

Step 10: Scale Up Replicas and Controller

Step 10: Scale Up Management Plane

Verification

Check Database Connectivity

Verify Data

Check Replication Status

Advanced Recovery Options

Recovery to Specific Point in Time

Recovery to Specific Transaction ID

Recovery to Named Restore Point

Troubleshooting

Recovery Fails with "requested timeline is not a child of this server's history"

"could not open file: No such file or directory"

Recovery Stops at "consistent recovery state reached" but doesn't complete

Insufficient Space for WAL Archives

Best Practices

References