Skip to content

Backup & Recovery

Overview

ZKProva uses AWS RDS PostgreSQL 16 with automated backups, Multi-AZ deployment, and deletion protection enabled by default.

Recovery Objectives

Metric Target
RTO (Recovery Time Objective) 30 minutes
RPO (Recovery Point Objective) 1 day (24 hours)

Backup Configuration

  • Automated backups: Daily at 03:00-04:00 UTC
  • Retention period: 7 days
  • Maintenance window: Sunday 05:00-06:00 UTC
  • Copy tags to snapshots: Enabled
  • Final snapshot on deletion: Enabled (identifier: zkprova-db-final-YYYY-MM-DD)

High Availability

  • Multi-AZ: Enabled — synchronous standby replica in a different AZ
  • Automatic failover: RDS handles failover transparently (typically 60-120 seconds)
  • Deletion protection: Enabled — must be disabled manually before instance deletion

Monitoring

  • CloudWatch alarm: zkprova-rds-free-storage-low fires when free storage drops below 2 GB
  • Alarm notifications are sent to the configured SNS topic

Manual Restore Procedure

From Automated Snapshot

# 1. List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier zkprova-db \
  --snapshot-type automated \
  --query 'reverse(sort_by(DBSnapshots, &SnapshotCreateTime))[:5].[DBSnapshotIdentifier,SnapshotCreateTime]' \
  --output table

# 2. Restore to a new instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier zkprova-db-restored \
  --db-snapshot-identifier <snapshot-id> \
  --db-instance-class db.t3.medium \
  --db-subnet-group-name zkprova-db-subnet-group \
  --multi-az

# 3. Wait for the new instance
aws rds wait db-instance-available \
  --db-instance-identifier zkprova-db-restored

# 4. Update application DATABASE_URL to point to new endpoint

# 5. Verify data integrity, then decommission old instance

Point-in-Time Recovery

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier zkprova-db \
  --target-db-instance-identifier zkprova-db-pit \
  --restore-time "2026-02-28T12:00:00Z" \
  --db-instance-class db.t3.medium \
  --multi-az

Automated Restore Testing

Run the restore test script weekly to verify backups are usable:

# Requires: AWS CLI configured, DB_PASSWORD and DB_USERNAME env vars
./scripts/test-rds-restore.sh

# Or with a custom identifier
./scripts/test-rds-restore.sh zkprova-db

The script: 1. Finds the latest automated snapshot 2. Restores to a temporary db.t3.micro instance 3. Runs smoke tests (verifies critical tables exist and have data) 4. Deletes the temporary instance

Weekly Test Schedule

Day Task
Monday Run ./scripts/test-rds-restore.sh and verify output
Quarterly Full disaster recovery drill — restore and switch traffic

Terraform Variables

Variable Default Description
multi_az true Multi-AZ deployment
backup_retention_period 7 Days to retain backups
backup_window 03:00-04:00 Daily backup window (UTC)
maintenance_window sun:05:00-sun:06:00 Weekly maintenance window
skip_final_snapshot false Skip final snapshot on deletion
deletion_protection true Prevent accidental deletion
alarm_sns_topic_arn "" SNS topic for CloudWatch alarms