Backup & Recovery¶
Overview¶
ZKProva uses AWS RDS PostgreSQL 16 with automated backups, Multi-AZ deployment, and deletion protection enabled by default.
Recovery Objectives¶
| Metric | Target |
|---|---|
| RTO (Recovery Time Objective) | 30 minutes |
| RPO (Recovery Point Objective) | 1 day (24 hours) |
Backup Configuration¶
- Automated backups: Daily at 03:00-04:00 UTC
- Retention period: 7 days
- Maintenance window: Sunday 05:00-06:00 UTC
- Copy tags to snapshots: Enabled
- Final snapshot on deletion: Enabled (identifier:
zkprova-db-final-YYYY-MM-DD)
High Availability¶
- Multi-AZ: Enabled — synchronous standby replica in a different AZ
- Automatic failover: RDS handles failover transparently (typically 60-120 seconds)
- Deletion protection: Enabled — must be disabled manually before instance deletion
Monitoring¶
- CloudWatch alarm:
zkprova-rds-free-storage-lowfires when free storage drops below 2 GB - Alarm notifications are sent to the configured SNS topic
Manual Restore Procedure¶
From Automated Snapshot¶
# 1. List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier zkprova-db \
--snapshot-type automated \
--query 'reverse(sort_by(DBSnapshots, &SnapshotCreateTime))[:5].[DBSnapshotIdentifier,SnapshotCreateTime]' \
--output table
# 2. Restore to a new instance
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier zkprova-db-restored \
--db-snapshot-identifier <snapshot-id> \
--db-instance-class db.t3.medium \
--db-subnet-group-name zkprova-db-subnet-group \
--multi-az
# 3. Wait for the new instance
aws rds wait db-instance-available \
--db-instance-identifier zkprova-db-restored
# 4. Update application DATABASE_URL to point to new endpoint
# 5. Verify data integrity, then decommission old instance
Point-in-Time Recovery¶
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier zkprova-db \
--target-db-instance-identifier zkprova-db-pit \
--restore-time "2026-02-28T12:00:00Z" \
--db-instance-class db.t3.medium \
--multi-az
Automated Restore Testing¶
Run the restore test script weekly to verify backups are usable:
# Requires: AWS CLI configured, DB_PASSWORD and DB_USERNAME env vars
./scripts/test-rds-restore.sh
# Or with a custom identifier
./scripts/test-rds-restore.sh zkprova-db
The script:
1. Finds the latest automated snapshot
2. Restores to a temporary db.t3.micro instance
3. Runs smoke tests (verifies critical tables exist and have data)
4. Deletes the temporary instance
Weekly Test Schedule¶
| Day | Task |
|---|---|
| Monday | Run ./scripts/test-rds-restore.sh and verify output |
| Quarterly | Full disaster recovery drill — restore and switch traffic |
Terraform Variables¶
| Variable | Default | Description |
|---|---|---|
multi_az |
true |
Multi-AZ deployment |
backup_retention_period |
7 |
Days to retain backups |
backup_window |
03:00-04:00 |
Daily backup window (UTC) |
maintenance_window |
sun:05:00-sun:06:00 |
Weekly maintenance window |
skip_final_snapshot |
false |
Skip final snapshot on deletion |
deletion_protection |
true |
Prevent accidental deletion |
alarm_sns_topic_arn |
"" |
SNS topic for CloudWatch alarms |