Point-in-Time Recovery (PITR) with Embedded PostgreSQL
This document describes how to configure and perform Point-in-Time Recovery (PITR) for the TSB embedded PostgreSQL database using WAL (Write-Ahead Log) archiving.
This page refers to the embedded PostgreSQL installation only. If you are using a dedicated PostgreSQL installation, please refer to the Backup and restore PostgreSQL page.
Overview
Point-in-Time Recovery (PITR) allows you to restore a PostgreSQL database to any specific point in time, providing protection against:
- Accidental data deletion
- Logical corruption
- Application bugs that corrupt data
- Unwanted schema changes
PITR works by combining:
- Base Backup: A full physical backup of the database cluster created using
pg_basebackup - WAL Archives: Continuous archive of all changes (Write-Ahead Logs) made after the base backup
This approach allows you to restore the database to any point in time between the base backup and the end of available WAL archives, or to a specific transaction ID or named restore point.
Architecture
Components
The TSB PostgreSQL PITR setup uses the following components:
- Persistent Volume Claims (PVCs):
tsb-backup: Stores base backups created by the backup CronJobtsb-postgres-wal-archive: Stores archived WAL files continuously
-
Backup CronJob: Scheduled job that runs
pg_basebackupto create base backups -
WAL Archiving: PostgreSQL continuously archives WAL files to the dedicated PVC
Data Flow
[Postgres Primary]
|
├─> WAL Files Generated
| |
| └─> Archived to /var/lib/postgresql/wal-archive/
| (mounted from tsb-postgres-wal-archive PVC)
|
└─> Base Backup (scheduled/manual)
|
└─> Stored in /var/lib/backup/
(mounted from tsb-backup PVC)
Prerequisites
Before you get started make sure:
✓ You have installed and configured TSB.
✓ You have installed and configured kubectl to access the management cluster.
✓ You have administrative access to the TSB namespace.
✓ You have the base64 command available for decoding secrets.
For development and testing purposes, a Makefile with automation scripts is available in the TSB repository at deploy/dev/pitr/Makefile. This documentation provides all the necessary commands inline, so the Makefile is not required for production use.
Configuration
To enable PITR for the embedded PostgreSQL database, configure the following settings in your ManagementPlane specification:
spec:
dataStore:
embeddedPostgres:
name: tsb
replicas: 3
# Enable WAL archiving for PITR
walArchiveStorageSize: 10Gi
walArchiveStorageClassName: standard
walArchiveStorageAccessMode: ReadWriteOnce
# Base backup configuration
backupStorageSize: 50Gi
backupStorageClassName: standard
backupStorageAccessMode: ReadWriteOnce
backupCronSchedule: "0 2 * * *" # Daily at 2 AM
backupCleanupRetention: 432000s # Retain backups for 5 days
Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
walArchiveStorageSize | Size of the PVC for storing WAL archives | - |
walArchiveStorageClassName | Storage class for WAL archive PVC | standard |
walArchiveStorageAccessMode | Access mode for WAL archive PVC | ReadWriteOnce |
backupStorageSize | Size of the PVC for storing base backups | - |
backupStorageClassName | Storage class for backup PVC | standard |
backupStorageAccessMode | Access mode for backup PVC | ReadWriteOnce |
backupCronSchedule | Cron schedule for automatic base backups | 0 2 * * * |
backupCleanupRetention | Retention period for old backups (in seconds) | 432000s (5 days) |
Creating Base Backups
Automatic Backups
Check a backupCronSchedule option to determine schedule of base backup. The backup CronJob is named backup-tsb-postgres.
You can verify the CronJob:
kubectl get cronjob -n tsb backup-tsb-postgres
Manual Backups
To create a manual base backup immediately:
BACKUP_DATE=$(date +%s)
kubectl create job -n tsb --from=cronjob/backup-tsb-postgres manual-backup-$BACKUP_DATE
Watch the backup progress:
kubectl wait --for=condition=complete --timeout=300s -n tsb job/manual-backup-$BACKUP_DATE
kubectl logs -n tsb -l job-name=manual-backup-$BACKUP_DATE
Verifying Base Backups
To list available base backups, create a temporary pod to access the backup PVC:
kubectl run -n tsb -it backup-viewer --image=nicolaka/netshoot --rm --restart=Never \
--overrides='{
"spec": {
"containers": [{
"name": "backup-viewer",
"image": "nicolaka/netshoot",
"stdin": true,
"tty": true,
"volumeMounts": [{
"mountPath": "/var/lib/backup",
"name": "backup-volume"
}]
}],
"volumes": [{
"name": "backup-volume",
"persistentVolumeClaim": {
"claimName": "tsb-backup"
}
}]
}
}'
Inside the pod, list the backups:
ls -lh /var/lib/backup
You should see directories like tsb-postgres-base-backup-DD_MM_YYYY_HH_MM_SS.
WAL Archiving
Monitoring WAL Archives
To view archived WAL files:
kubectl exec -n tsb tsb-postgres-1-0 -- ls -lh /var/lib/postgresql/wal-archive/
Forcing WAL File Switch
During normal operation, PostgreSQL switches to a new WAL file when the current one fills up (typically 16MB). To force an immediate WAL switch (useful for testing or ensuring recent changes are archived):
# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Force WAL switch
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT pg_switch_wal();'"
Wait a few seconds, then verify the new WAL file was archived:
kubectl exec -n tsb tsb-postgres-1-0 -- ls -lh /var/lib/postgresql/wal-archive/
Performing Point-in-Time Recovery
Step 1: Scale Down the Management Plane
Before starting the restore process, scale down TSB components to prevent database write activity:
kubectl scale deployment -n tsb tsb iam --replicas 0
Scaling down the TSB deployment only interrupts the ability to change configuration while restoration is in progress. It does not affect the data plane or running services.
Step 2: Scale Down PostgreSQL
Scale down the kubegres controller and PostgreSQL instances:
# First scale kubegres to single replica (removes replica pods)
kubectl patch kubegres -n tsb tsb-postgres --type=merge -p '{"spec":{"replicas":1}}'
kubectl wait -n tsb --for=delete --timeout=120s pod -l replicationRole=replica || true
# Scale down kubegres controller
kubectl scale deployment -n tsb kubegres-controller-manager --replicas=0
kubectl wait -n tsb --for=delete --timeout=120s pod -l app=kubegres-controller-manager
# Scale down primary statefulset
kubectl scale statefulset -n tsb -l replicationRole=primary --replicas=0
kubectl wait -n tsb --for=delete --timeout=120s pod -l replicationRole=primary
Step 3: Deploy Restore Helper Pod
Get the PostgreSQL image version:
export POSTGRES_IMAGE=$(kubectl get kubegres -n tsb tsb-postgres -o jsonpath='{.spec.image}')
echo $POSTGRES_IMAGE
Create a helper pod with access to all required volumes:
cat <<EOF | kubectl apply -n tsb -f -
apiVersion: v1
kind: Pod
metadata:
name: postgres-restore
namespace: tsb
spec:
serviceAccountName: tsb-kubegres
securityContext:
runAsUser: 70 # postgres user
fsGroup: 70
containers:
- name: postgres
image: $POSTGRES_IMAGE
command: ["sleep", "infinity"]
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
- name: backup-volume
mountPath: /var/lib/backup
- name: wal-archive
mountPath: /var/lib/postgresql/wal-archive
- name: tsb-postgres-certs
mountPath: /var/lib/postgresql/data/tls
readOnly: true
volumes:
- name: postgres-data
persistentVolumeClaim:
claimName: postgres-db-tsb-postgres-1-0
- name: backup-volume
persistentVolumeClaim:
claimName: tsb-backup
- name: wal-archive
persistentVolumeClaim:
claimName: tsb-postgres-wal-archive
- name: tsb-postgres-certs
secret:
secretName: tsb-postgres-certs
defaultMode: 0600
EOF
Wait for the pod to be ready:
kubectl wait -n tsb --for=condition=ready pod/postgres-restore --timeout=120s
Step 4: Prepare the Data Directory
Enter the restore helper pod:
kubectl exec -n tsb -it postgres-restore -- bash
Inside the pod, remove the current data directory:
This will delete all existing data in the PostgreSQL data directory. Ensure you have selected the correct base backup before proceeding. Alternatively, you can back up the existing data directory before deletion.
rm -rf /var/lib/postgresql/data/pgdata/*
Step 5: Extract Base Backup
Choose a base backup to restore from (list available backups in /var/lib/backup/):
ls -lh /var/lib/backup/
Extract the chosen base backup:
cd /var/lib/postgresql/data/pgdata
# Replace with your actual backup directory name
BACKUP_DIR="tsb-postgres-base-backup-DD_MM_YYYY_HH_MM_SS"
tar xzf /var/lib/backup/$BACKUP_DIR/base.tar.gz
tar xzf /var/lib/backup/$BACKUP_DIR/pg_wal.tar.gz
Verify the extraction:
ls -lh /var/lib/postgresql/data/pgdata/
This should show the PostgreSQL data files and the pg_wal directory populated with WAL files.
Step 6: Configure Recovery
Create the recovery configuration file:
cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Optional: Recover to specific point in time
# recovery_target_time = '2024-12-04 14:35:00+00'
# Or recover to end of WAL (default if not specified)
RECOVERY_EOF
Create the standby.signal file to trigger recovery mode (PostgreSQL 12+):
touch /var/lib/postgresql/data/pgdata/standby.signal
Verify the recovery configuration:
cat /var/lib/postgresql/data/pgdata/postgresql.auto.conf
ls -la /var/lib/postgresql/data/pgdata/standby.signal
Exit the pod:
exit
Step 7: Clean Up Restore Pod
Delete the restore helper pod:
kubectl delete pod -n tsb postgres-restore
Step 8: Scale Up PostgreSQL
Scale up the primary PostgreSQL instance:
kubectl scale statefulset -n tsb -l replicationRole=primary --replicas=1
kubectl wait -n tsb --for=condition=ready --timeout=180s pod -l replicationRole=primary
Watch the recovery process in the logs:
kubectl logs -n tsb -f tsb-postgres-1-0
Look for log messages indicating recovery progress:
LOG: starting point-in-time recovery to latest
LOG: restored log file "000000010000000000000002" from archive
LOG: redo starts at 0/2000028
LOG: consistent recovery state reached at 0/20000F8
LOG: restored log file "000000010000000000000003" from archive
The database will automatically promote itself when recovery is complete (due to recovery_target_action = 'promote'). Watch for these log messages indicating successful promotion:
LOG: selected new timeline ID: 2
LOG: archive recovery complete
LOG: database system is ready to accept connections
Step 9: Verify Database Promotion
Check that the database has been promoted and is accepting read-write connections:
# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Check if database is in recovery mode (should return 'f' for false after promotion)
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT pg_is_in_recovery();'"
If the database shows it's still in recovery mode (t for true), you may need to manually promote it:
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "pg_ctl promote -D /var/lib/postgresql/data/pgdata"
After promotion (automatic or manual), verify the database is ready:
kubectl logs -n tsb tsb-postgres-1-0 --tail=20
You should see:
LOG: database system is ready to accept connections
Step 10: Scale Up Replicas and Controller
Once the primary is running and recovered, scale up the kubegres controller and replicas:
# Scale up kubegres controller
kubectl scale deployment -n tsb kubegres-controller-manager --replicas=1
kubectl wait -n tsb --for=condition=available --timeout=120s deployment/kubegres-controller-manager
# Scale up replicas
kubectl patch kubegres -n tsb tsb-postgres --type=merge -p '{"spec":{"replicas":3}}'
kubectl wait -n tsb --for=condition=ready --timeout=180s pod -l replicationRole=replica
Step 10: Scale Up Management Plane
Finally, scale up the TSB components:
kubectl scale deployment -n tsb tsb iam --replicas 1
Verification
After the restore is complete, verify that the database is functioning correctly.
First, get the PostgreSQL password:
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
Check Database Connectivity
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT version();'"
Verify Data
Check that your data is present at the expected recovery point. For example, to list tenants:
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT name FROM tenant;'"
Check Replication Status
Verify that replication to replicas is working:
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c 'SELECT * FROM pg_stat_replication;'"
Advanced Recovery Options
Recovery to Specific Point in Time
To recover to a specific timestamp instead of the end of available WAL logs, modify the recovery configuration in Step 6:
cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to specific point in time (adjust timezone as needed)
recovery_target_time = '2024-12-04 14:35:00+00'
RECOVERY_EOF
Recovery to Specific Transaction ID
If you know the transaction ID you want to recover to:
cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to specific transaction ID
recovery_target_xid = '12345'
RECOVERY_EOF
Recovery to Named Restore Point
You can create named restore points before critical operations:
# Get the postgres password
POSTGRES_PASSWORD=$(kubectl get secret -n tsb postgres-credentials -o jsonpath='{.data.embedded-postgres-password}' | base64 -d)
# Create a named restore point
kubectl exec -n tsb tsb-postgres-1-0 -- \
bash -c "PGPASSWORD='${POSTGRES_PASSWORD}' psql -U tsb -d tsb -c \"SELECT pg_create_restore_point('before_critical_operation');\""
Then during recovery, use the named restore point:
cat > /var/lib/postgresql/data/pgdata/postgresql.auto.conf << 'RECOVERY_EOF'
# Recovery configuration for PITR
restore_command = 'test -f /var/lib/postgresql/wal-archive/%f && cp /var/lib/postgresql/wal-archive/%f %p'
recovery_target_action = 'promote'
# Recover to named restore point
recovery_target_name = 'before_critical_operation'
RECOVERY_EOF
Troubleshooting
Recovery Fails with "requested timeline is not a child of this server's history"
This error indicates a timeline mismatch. Ensure you're using the correct base backup and WAL files from the same timeline.
Solution: Start with a fresh base backup and ensure the WAL archive directory contains all necessary WAL files from that backup forward.
"could not open file: No such file or directory"
This error during recovery means a required WAL file is missing from the archive.
Solution:
- Check that WAL archiving was properly configured and running
- Verify all WAL files are present in
/var/lib/postgresql/wal-archive/ - You may need to recover to an earlier point in time where all WAL files are available
Recovery Stops at "consistent recovery state reached" but doesn't complete
If recovery appears to hang after reaching a consistent state, check the recovery logs for errors.
Solution:
- Check pod logs:
kubectl logs -n tsb tsb-postgres-1-0 - Verify the
recovery_target_actionis set topromote - Ensure all referenced WAL files are accessible
Insufficient Space for WAL Archives
If WAL archive PVC fills up, archiving will fail and PITR will be limited.
Solution:
- Increase
walArchiveStorageSizein the ManagementPlane spec - Implement WAL archive cleanup policies
- Consider moving old archives to external storage
Best Practices
- Regular Base Backups: Ensure base backups run regularly (daily recommended)
- Monitor WAL Archive Space: Set up alerts for WAL archive PVC usage
- Test Recovery Procedures: Periodically test your PITR process in a non-production environment
- Document Recovery Points: Before critical operations, create named restore points
- Retain Multiple Base Backups: Keep at least 3-5 base backups for flexibility
- Archive Offline Copies: Periodically copy base backups and WAL archives to external storage
- Monitor Backup Jobs: Set up alerts for backup job failures