Skip to main content
logoTetrate Service BridgeVersion: next

Point-in-Time Recovery (PITR) with Embedded PostgreSQL

This document describes how to configure and perform Point-in-Time Recovery (PITR) for the TSB embedded PostgreSQL database using WAL (Write-Ahead Log) archiving.

note

This page refers to the embedded PostgreSQL installation only. If you are using a dedicated PostgreSQL installation, please refer to the Backup and restore PostgreSQL page.

Overview

Point-in-Time Recovery (PITR) allows you to restore a PostgreSQL database to any specific point in time, providing protection against:

  • Accidental data deletion
  • Logical corruption
  • Application bugs that corrupt data
  • Unwanted schema changes

PITR works by combining:

  1. Base Backup: A full physical backup of the database cluster created using pg_basebackup
  2. WAL Archives: Continuous archive of all changes (Write-Ahead Logs) made after the base backup

This approach allows you to restore the database to any point in time between the base backup and the end of available WAL archives, or to a specific transaction ID or named restore point.

Architecture

Components

The TSB PostgreSQL PITR setup uses the following components:

  1. Persistent Volume Claims (PVCs):
  • tsb-backup: Stores base backups created by the backup CronJob
  • tsb-postgres-wal-archive: Stores archived WAL files continuously
  1. Backup CronJob: Scheduled job that runs pg_basebackup to create base backups

  2. WAL Archiving: PostgreSQL continuously archives WAL files to the dedicated PVC

Data Flow

[Postgres Primary]
|
├─> WAL Files Generated
| |
| └─> Archived to /var/lib/postgresql/wal-archive/
| (mounted from tsb-postgres-wal-archive PVC)
|
└─> Base Backup (scheduled/manual)
|
└─> Stored in /var/lib/backup/
(mounted from tsb-backup PVC)

Prerequisites

Before you get started make sure:

✓ You have installed and configured TSB.
✓ You have installed and configured kubectl to access the management cluster.
✓ You have administrative access to the TSB namespace.
✓ You have the base64 command available for decoding secrets.

Development Environment

For development and testing purposes, a Makefile with automation scripts is available in the TSB repository at deploy/dev/pitr/Makefile. This documentation provides all the necessary commands inline, so the Makefile is not required for production use.

Configuration

To enable PITR for the embedded PostgreSQL database, configure the following settings in your ManagementPlane specification:

spec:
dataStore:
embeddedPostgres:
name: tsb
replicas: 3
# Enable WAL archiving for PITR
walArchiveStorageSize: 10Gi
walArchiveStorageClassName: standard
walArchiveStorageAccessMode: ReadWriteOnce
# Base backup configuration
backupStorageSize: 50Gi
backupStorageClassName: standard
backupStorageAccessMode: ReadWriteOnce
backupCronSchedule: "0 2 * * *" # Daily at 2 AM
backupCleanupRetention: 432000s # Retain backups for 5 days

Configuration Parameters

ParameterDescriptionDefault
walArchiveStorageSizeSize of the PVC for storing WAL archives-
walArchiveStorageClassNameStorage class for WAL archive PVCstandard
walArchiveStorageAccessModeAccess mode for WAL archive PVCReadWriteOnce
backupStorageSizeSize of the PVC for storing base backups-
backupStorageClassNameStorage class for backup PVCstandard
backupStorageAccessModeAccess mode for backup PVCReadWriteOnce
backupCronScheduleCron schedule for automatic base backups0 2 * * *
backupCleanupRetentionRetention period for old backups (in seconds)432000s (5 days)

Creating Base Backups

Automatic Backups

Check a backupCronSchedule option to determine schedule of base backup. The backup CronJob is named backup-tsb-postgres.

You can verify the CronJob:

kubectl get cronjob -n tsb backup-tsb-postgres

Manual Backups

To create a manual base backup immediately:

BACKUP_DATE=$(date +%s)
kubectl create job -n tsb --from=cronjob/backup-tsb-postgres manual-backup-$BACKUP_DATE

Watch the backup progress:

kubectl wait --for=condition=complete --timeout=300s -n tsb job/manual-backup-$BACKUP_DATE
kubectl logs -n tsb -l job-name=manual-backup-$BACKUP_DATE

Verifying Base Backups

To list available base backups, create a temporary pod to access the backup PVC:

kubectl run -n tsb -it backup-viewer --image=nicolaka/netshoot --rm --restart=Never \
--overrides='{
"spec": {
"containers": [{
"name": "backup-viewer",
"image": "nicolaka/netshoot",
"stdin": true,
"tty": true,
"volumeMounts": [{
"mountPath": "/var/lib/backup",
"name": "backup-volume"
}]
}],
"volumes": [{
"name": "backup-volume",
"persistentVolumeClaim": {
"claimName": "tsb-backup"
}
}]
}
}'

Inside the pod, list the backups:

ls -lh /var/lib/backup

You should see directories like tsb-postgres-base-backup-DD_MM_YYYY_HH_MM_SS.

WAL Archiving

Monitoring WAL Archives

To view archived WAL files:

kubectl exec -n tsb tsb-postgres-1-0 -- ls -lh /var/lib/postgresql/wal-archive/

Forcing WAL File Switch

During normal operation, PostgreSQL switches to a new WAL file when the current one fills up (typically 16MB). To force an immediate WAL switch (useful for testing or ensuring recent changes are archived):

# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Force WAL switch
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT pg_switch_wal();'"

Wait a few seconds, then verify the new WAL file was archived:

kubectl exec -n tsb tsb-postgres-1-0 -- ls -lh /var/lib/postgresql/wal-archive/

Performing Point-in-Time Recovery

Step 1: Scale Down the Management Plane

Before starting the restore process, scale down TSB components to prevent database write activity:

kubectl scale deployment -n tsb tsb iam --replicas 0
note

Scaling down the TSB deployment only interrupts the ability to change configuration while restoration is in progress. It does not affect the data plane or running services.

Step 2: Scale Down PostgreSQL

Scale down the kubegres controller and PostgreSQL instances:

# First scale kubegres to single replica (removes replica pods)
kubectl patch kubegres -n tsb tsb-postgres --type=merge -p '{"spec":{"replicas":1}}'
kubectl wait -n tsb --for=delete --timeout=120s pod -l replicationRole=replica || true
# Scale down kubegres controller
kubectl scale deployment -n tsb kubegres-controller-manager --replicas=0
kubectl wait -n tsb --for=delete --timeout=120s pod -l app=kubegres-controller-manager
# Scale down primary statefulset
kubectl scale statefulset -n tsb -l replicationRole=primary --replicas=0
kubectl wait -n tsb --for=delete --timeout=120s pod -l replicationRole=primary

Step 3: Deploy Restore Helper Pod

Get the PostgreSQL image version:

export POSTGRES_IMAGE=$(kubectl get kubegres -n tsb tsb-postgres -o jsonpath='{.spec.image}')
echo $POSTGRES_IMAGE

Create a helper pod with access to all required volumes:

cat <<EOF | kubectl apply -n tsb -f -
apiVersion: v1
kind: Pod
metadata:
name: postgres-restore
namespace: tsb
spec:
serviceAccountName: tsb-kubegres
securityContext:
runAsUser: 70 # postgres user
fsGroup: 70
containers:
- name: postgres
image: $POSTGRES_IMAGE
command: ["sleep", "infinity"]
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
- name: backup-volume
mountPath: /var/lib/backup
- name: wal-archive
mountPath: /var/lib/postgresql/wal-archive
- name: tsb-postgres-certs
mountPath: /var/lib/postgresql/data/tls
readOnly: true
volumes:
- name: postgres-data
persistentVolumeClaim:
claimName: postgres-db-tsb-postgres-1-0
- name: backup-volume
persistentVolumeClaim:
claimName: tsb-backup
- name: wal-archive
persistentVolumeClaim:
claimName: tsb-postgres-wal-archive
- name: tsb-postgres-certs
secret:
secretName: tsb-postgres-certs
defaultMode: 0600
EOF

Wait for the pod to be ready:

kubectl wait -n tsb --for=condition=ready pod/postgres-restore --timeout=120s

Step 4: Prepare the Data Directory

Enter the restore helper pod:

kubectl exec -n tsb -it postgres-restore -- bash

Inside the pod, remove the current data directory:

note

This will delete all existing data in the PostgreSQL data directory. Ensure you have selected the correct base backup before proceeding. Alternatively, you can back up the existing data directory before deletion.

rm -rf /var/lib/postgresql/data/pgdata/*

Step 5: Extract Base Backup

Choose a base backup to restore from (list available backups in /var/lib/backup/):

ls -lh /var/lib/backup/

Extract the chosen base backup:

cd /var/lib/postgresql/data/pgdata
# Replace with your actual backup directory name
BACKUP_DIR="tsb-postgres-base-backup-DD_MM_YYYY_HH_MM_SS"
tar xzf /var/lib/backup/$BACKUP_DIR/base.tar.gz
tar xzf /var/lib/backup/$BACKUP_DIR/pg_wal.tar.gz

Verify the extraction:

ls -lh /var/lib/postgresql/data/pgdata/

This should show the PostgreSQL data files and the pg_wal directory populated with WAL files.

Step 6: Configure Recovery

Create the recovery configuration file:

cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Optional: Recover to specific point in time
# recovery_target_time = '2024-12-04 14:35:00+00'
# Or recover to end of WAL (default if not specified)
RECOVERY_EOF

Create the standby.signal file to trigger recovery mode (PostgreSQL 12+):

touch /var/lib/postgresql/data/pgdata/standby.signal

Verify the recovery configuration:

cat /var/lib/postgresql/data/pgdata/postgresql.auto.conf
ls -la /var/lib/postgresql/data/pgdata/standby.signal

Exit the pod:

exit

Step 7: Clean Up Restore Pod

Delete the restore helper pod:

kubectl delete pod -n tsb postgres-restore

Step 8: Scale Up PostgreSQL

Scale up the primary PostgreSQL instance:

kubectl scale statefulset -n tsb -l replicationRole=primary --replicas=1
kubectl wait -n tsb --for=condition=ready --timeout=180s pod -l replicationRole=primary

Watch the recovery process in the logs:

kubectl logs -n tsb -f tsb-postgres-1-0

Look for log messages indicating recovery progress:

LOG:  starting point-in-time recovery to latest
LOG: restored log file "000000010000000000000002" from archive
LOG: redo starts at 0/2000028
LOG: consistent recovery state reached at 0/20000F8
LOG: restored log file "000000010000000000000003" from archive

The database will automatically promote itself when recovery is complete (due to recovery_target_action = 'promote'). Watch for these log messages indicating successful promotion:

LOG:  selected new timeline ID: 2
LOG: archive recovery complete
LOG: database system is ready to accept connections

Step 9: Verify Database Promotion

Check that the database has been promoted and is accepting read-write connections:

# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Check if database is in recovery mode (should return 'f' for false after promotion)
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT pg_is_in_recovery();'"

If the database shows it's still in recovery mode (t for true), you may need to manually promote it:

kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "pg_ctl promote -D /var/lib/postgresql/data/pgdata"

After promotion (automatic or manual), verify the database is ready:

kubectl logs -n tsb tsb-postgres-1-0 --tail=20

You should see:

LOG:  database system is ready to accept connections

Step 10: Scale Up Replicas and Controller

Once the primary is running and recovered, scale up the kubegres controller and replicas:

# Scale up kubegres controller
kubectl scale deployment -n tsb kubegres-controller-manager --replicas=1
kubectl wait -n tsb --for=condition=available --timeout=120s deployment/kubegres-controller-manager
# Scale up replicas
kubectl patch kubegres -n tsb tsb-postgres --type=merge -p '{"spec":{"replicas":3}}'
kubectl wait -n tsb --for=condition=ready --timeout=180s pod -l replicationRole=replica

Step 10: Scale Up Management Plane

Finally, scale up the TSB components:

kubectl scale deployment -n tsb tsb iam --replicas 1

Verification

After the restore is complete, verify that the database is functioning correctly.

First, get the PostgreSQL password:

POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)

Check Database Connectivity

kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT version();'"

Verify Data

Check that your data is present at the expected recovery point. For example, to list tenants:

kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT name FROM tenant;'"

Check Replication Status

Verify that replication to replicas is working:

kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT * FROM pg_stat_replication;'"

Advanced Recovery Options

Recovery to Specific Point in Time

To recover to a specific timestamp instead of the end of available WAL logs, modify the recovery configuration in Step 6:

cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to specific point in time (adjust timezone as needed)
recovery_target_time = '2024-12-04 14:35:00+00'
RECOVERY_EOF

Recovery to Specific Transaction ID

If you know the transaction ID you want to recover to:

cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to specific transaction ID
recovery_target_xid = '12345'
RECOVERY_EOF

Recovery to Named Restore Point

You can create named restore points before critical operations:

# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Create a named restore point
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c \"SELECT pg_create_restore_point('before_critical_operation');\""

Then during recovery, use the named restore point:

cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to named restore point
recovery_target_name = 'before_critical_operation'
RECOVERY_EOF

Troubleshooting

Recovery Fails with "requested timeline is not a child of this server's history"

This error indicates a timeline mismatch. Ensure you're using the correct base backup and WAL files from the same timeline.

Solution: Start with a fresh base backup and ensure the WAL archive directory contains all necessary WAL files from that backup forward.

"could not open file: No such file or directory"

This error during recovery means a required WAL file is missing from the archive.

Solution:

  1. Check that WAL archiving was properly configured and running
  2. Verify all WAL files are present in /var/lib/postgresql/wal-archive/
  3. You may need to recover to an earlier point in time where all WAL files are available

Recovery Stops at "consistent recovery state reached" but doesn't complete

If recovery appears to hang after reaching a consistent state, check the recovery logs for errors.

Solution:

  1. Check pod logs: kubectl logs -n tsb tsb-postgres-1-0
  2. Verify the recovery_target_action is set to promote
  3. Ensure all referenced WAL files are accessible

Insufficient Space for WAL Archives

If WAL archive PVC fills up, archiving will fail and PITR will be limited.

Solution:

  1. Increase walArchiveStorageSize in the ManagementPlane spec
  2. Implement WAL archive cleanup policies
  3. Consider moving old archives to external storage

Best Practices

  1. Regular Base Backups: Ensure base backups run regularly (daily recommended)
  2. Monitor WAL Archive Space: Set up alerts for WAL archive PVC usage
  3. Test Recovery Procedures: Periodically test your PITR process in a non-production environment
  4. Document Recovery Points: Before critical operations, create named restore points
  5. Retain Multiple Base Backups: Keep at least 3-5 base backups for flexibility
  6. Archive Offline Copies: Periodically copy base backups and WAL archives to external storage
  7. Monitor Backup Jobs: Set up alerts for backup job failures

References