Business Continuity Plan¶
Organization: ZKProva System: ZKP-Powered Portable Credit Union Identity SOC 2 Criteria: A1.3 (Recovery from Disruptions) Document Version: 1.0 Effective Date: 2026-02-28 Classification: Confidential Review Cadence: Quarterly (next review: 2026-05-31)
Table of Contents¶
- Recovery Objectives
- System Component Recovery
- External Dependency Map
- Failover Procedures
- Testing Cadence
- Production Configuration Reference
- Document Control
Recovery Objectives¶
| Component | RPO (Data Loss Tolerance) | RTO (Recovery Time) | Justification |
|---|---|---|---|
| PostgreSQL (RDS) | 24 hours | 30 minutes | Automated daily backups at 03:00–04:00 UTC with 7-day retention. Point-in-time recovery available within backup window. Multi-AZ failover for infrastructure failures. |
| Redis (ElastiCache) | 0 (ephemeral) | 5 minutes | Redis stores only rate-limit counters and session cache. Data loss is non-catastrophic — counters reset and sessions require re-authentication. Rebuilt from scratch on failure. |
| EKS (Application Tier) | 0 (stateless) | 15 minutes | Application pods are stateless. All state lives in RDS and Redis. Recovery = new pod scheduling via Kubernetes. PDB ensures minimum availability during disruptions. |
| AWS Secrets Manager | 0 (managed) | 10 minutes | AWS-managed durability. Secrets cached at application startup via @lru_cache. Recovery requires kubectl rollout restart to refresh cache if secrets are rotated. |
| TLS Certificates (ACM) | 0 (managed) | 0 (auto-renewal) | AWS Certificate Manager handles renewal automatically. Certificate covers *.zkprova.com with DNS validation. |
Recovery Priority Order¶
In a multi-component failure, recover in this order:
- RDS PostgreSQL — All business data, credentials, and audit logs depend on it
- AWS Secrets Manager — Required for application startup (JWT key, encryption key, Ed25519 key)
- EKS Application Pods — Require DB and secrets to be available
- ElastiCache Redis — Rate limiting and session cache; non-critical for core functionality
- Frontend — Can operate in degraded mode while backend recovers
System Component Recovery¶
PostgreSQL (RDS)¶
Instance: zkprova-db | Engine: PostgreSQL 16 | Storage: gp3, encrypted
Multi-AZ: Enabled | Backup Window: 03:00–04:00 UTC | Retention: 7 days
Automatic Failover (Infrastructure Failure)¶
RDS Multi-AZ provides automatic failover to the standby replica:
- Trigger: Primary instance failure, AZ outage, or instance type change
- Duration: 60–120 seconds
- Action required: None — DNS endpoint automatically re-routes
- Verification: Check
/healthendpoint (includes DB connectivity check). Monitorzkprova-rds-free-storage-lowalarm.
Manual Point-in-Time Recovery (Data Corruption / Breach)¶
# 1. Identify target recovery time
aws rds describe-db-instances --db-instance-identifier zkprova-db \
--query 'DBInstances[0].LatestRestorableTime'
# 2. Restore to new instance
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier zkprova-db \
--target-db-instance-identifier zkprova-db-recovery \
--restore-time "2026-02-28T02:00:00Z" \
--db-subnet-group-name zkprova-db-subnet \
--vpc-security-group-ids <sg-id>
# 3. Verify restored data integrity
psql -h zkprova-db-recovery.<region>.rds.amazonaws.com \
-U zkprova -d zkprova \
-c "SELECT COUNT(*) FROM members; SELECT COUNT(*) FROM credentials; SELECT COUNT(*) FROM audit_logs;"
# 4. Swap DNS or update application config to point to recovered instance
# 5. Delete original compromised instance after verification
Existing automation: Weekly restore test via ./scripts/test-rds-restore.sh — restores latest snapshot to a temporary instance, runs smoke tests, and deletes the temporary instance.
ElastiCache (Redis)¶
Replication Group: zkprova-redis | Engine: Redis 7.0 | Node: cache.t3.micro
Encryption: At-rest and in-transit enabled | Auth: Token required
Recovery Procedure¶
Redis stores only ephemeral data (rate-limit sliding windows, session tokens). On failure:
- ElastiCache automatically replaces failed nodes
- If full replication group failure: Terraform re-applies the ElastiCache module
- Impact of Redis loss:
- Rate limiting temporarily disabled (falls back to application-level limiting)
- Active sessions invalidated (members must re-authenticate)
- No business data loss
EKS (Application Tier)¶
Cluster: zkprova | Version: 1.29 | Nodes: t3.medium, min 1 / max 4
Self-Healing Mechanisms¶
| Mechanism | Configuration | Effect |
|---|---|---|
| Liveness probes | HTTP check on /health |
Restarts unresponsive pods |
| Readiness probes | HTTP check on /health (includes DB check) |
Removes unhealthy pods from service |
| HPA (backend) | min 3, max 20, target 60% CPU | Scales capacity with load |
| HPA (frontend) | min 3, max 10 | Scales frontend independently |
| PDB (backend) | minAvailable: 2 | Prevents disruption below 2 healthy pods |
| PDB (frontend) | minAvailable: 2 | Prevents disruption below 2 healthy pods |
Pod Recovery¶
Pods are stateless. Recovery is automatic via Kubernetes scheduler:
- Single pod failure: Replaced within seconds by ReplicaSet controller
- Node failure: Pods rescheduled to healthy nodes (EKS auto-scaling group provisions replacement node)
- Full cluster failure: Re-provision via Terraform (
module.eks), redeploy via Helm
Application Rollback¶
# List release history
helm history zkprova -n default
# Rollback to previous release
helm rollback zkprova <revision> -n default
# Verify rollback
kubectl rollout status deployment/zkprova-backend
kubectl rollout status deployment/zkprova-frontend
./scripts/smoke-test.sh
WAF Emergency Controls¶
The CloudFront WAF includes a block_all_traffic toggle for emergency maintenance mode:
# In terraform/modules/waf/main.tf
# Set block_all_traffic = true to block ALL incoming requests
# Use only during active P0 incidents requiring full service isolation
Apply via Terraform:
External Dependency Map¶
| Dependency | Purpose | Failure Impact | Mitigation |
|---|---|---|---|
| AWS RDS | Primary database | Complete service outage | Multi-AZ failover (60–120s), PITR backups |
| AWS ElastiCache | Rate limiting, session cache | Degraded rate limiting, session loss | Ephemeral data only; auto-replacement |
| AWS Secrets Manager | JWT key, AES key, Ed25519 key, DB password | Pods cannot start without secrets | Secrets cached via @lru_cache at startup; running pods unaffected by brief outages |
| AWS ACM | TLS certificates (*.zkprova.com) |
TLS termination failure | Auto-renewal; long validity period |
| AWS WAF | Request filtering, rate limiting, geo-blocking | Reduced security posture | Application-level rate limiting as fallback |
| AWS SES | Email delivery (verification, notifications) | Email features unavailable | zkprova-prod-email-failures alarm (>3 failures/10min). Non-critical path — core ZKP operations unaffected. |
| GitHub | Source control, CI/CD | Cannot deploy new versions | Running production unaffected. Manual deploy possible via local kubectl. |
| GitHub Container Registry | Docker image storage | Cannot pull new images | Existing images cached on EKS nodes. Running pods unaffected. |
| Expo | Mobile app OTA updates | Cannot push mobile updates | Mobile app functions offline with cached credentials. |
| NCUA API | Credit union charter validation | Cannot validate new issuers | Fail-closed: new issuer registration blocked. Existing issuers unaffected. Grace period for transient failures. |
| snarkjs | ZKP proof generation/verification | Proof operations fail | Bundled as npm dependency — no external call. Failure = application bug, not dependency outage. |
| Ed25519 keys | Credential signing (DID:key) | Cannot issue new credentials | Keys in Secrets Manager. Existing credentials remain valid. See Key Rotation Procedures. |
Single Points of Failure¶
| Component | Risk | Mitigation Status |
|---|---|---|
| RDS Primary | AZ-level failure | Mitigated: Multi-AZ enabled |
| EKS Control Plane | AWS-managed HA | Mitigated: AWS manages multi-AZ control plane |
| Secrets Manager | Regional failure | Partial: No cross-region replication configured |
| Ed25519 Signing Key | Key compromise | Partial: No HSM (Gap #6, target Q3 2026). Key in Secrets Manager with IAM access control. |
Failover Procedures¶
Scenario: RDS Primary Failure¶
- Automatic: RDS promotes standby to primary (60–120 seconds)
- Verify:
/healthendpoint returns 200 withdb: connected - Monitor: Check
zkprova-rds-free-storage-lowalarm. Verify replication lag is zero. - No action required unless failover does not complete within 5 minutes — then engage AWS Support.
Scenario: Complete AZ Failure¶
- Automatic: EKS reschedules pods to nodes in surviving AZ
- Automatic: RDS fails over to standby in surviving AZ
- Verify:
kubectl get pods -o wide— confirm pods running in healthy AZ - Monitor: HPA may scale up to compensate for reduced node capacity
Scenario: EKS Node Exhaustion¶
- Automatic: Cluster Autoscaler provisions new nodes (up to max 4)
- If max reached: Evaluate increasing
max_sizein Terraformmodule.eks - Emergency: Manually scale node group via AWS Console or
aws eks update-nodegroup-config
Scenario: Secrets Manager Outage¶
- Impact: Running pods are unaffected (secrets cached via
@lru_cache) - Risk: New pods cannot start;
kubectl rollout restartwill fail - Mitigation: Do not restart pods during Secrets Manager outage
- Recovery: Once Secrets Manager recovers, verify pods can fetch secrets by triggering a controlled restart of a single pod
Testing Cadence¶
| Test | Frequency | Procedure | Last Tested | Next Test |
|---|---|---|---|---|
| RDS snapshot restore | Weekly (automated) | ./scripts/test-rds-restore.sh — restores latest snapshot to temp instance, runs smoke tests, deletes temp instance |
Ongoing | Continuous |
| Helm rollback | Per deployment | Post-deploy verification includes rollback readiness check | Per deploy | Per deploy |
| Full DR drill | Quarterly | Simulate RDS failure + full application recovery. Measure actual RTO/RPO against targets. Document gaps. | Not yet conducted | Q2 2026 |
| BCP tabletop exercise | Annually | Walk through complete AZ failure scenario with all stakeholders. Review dependency map and failover procedures. | Not yet conducted | Q3 2026 |
| Secrets rotation | Per rotation event | Verify application recovers after kubectl rollout restart following secret update |
Per rotation | Per rotation |
DR Drill Procedure (Quarterly)¶
- Pre-drill: Notify team. Ensure staging environment mirrors production config.
- Execute on staging (
api-staging.zkprova.com): - Terminate RDS primary (simulate failure)
- Verify automatic Multi-AZ failover
- Measure actual failover duration
- Run smoke tests against recovered environment
- Perform PITR to verify backup integrity
- Document: Actual RTO/RPO achieved, any deviations from expected behavior
- Remediate: Create GitHub Issues for any gaps discovered
Production Configuration Reference¶
Helm Production Values (values-prod.yaml)¶
| Parameter | Value | Purpose |
|---|---|---|
| Backend replicas | 3 | Base capacity for production load |
| Backend HPA min/max | 3 / 20 | Auto-scaling range |
| Backend HPA CPU target | 60% | Scale-up trigger |
| Backend PDB minAvailable | 2 | Disruption budget — always at least 2 pods serving |
| Frontend replicas | 3 | Base capacity |
| Frontend HPA min/max | 3 / 10 | Auto-scaling range |
| Frontend PDB minAvailable | 2 | Disruption budget |
| DB pool size | 50 | Connection pool per pod |
| DB max overflow | 30 | Burst connections per pod |
Terraform Infrastructure¶
| Resource | Configuration | Module |
|---|---|---|
| RDS instance | zkprova-db, PostgreSQL 16, gp3 encrypted, Multi-AZ, 7-day backup retention |
module.rds |
| ElastiCache | zkprova-redis, Redis 7.0, cache.t3.micro, encryption at rest + transit, auth token |
module.elasticache |
| EKS cluster | zkprova, v1.29, t3.medium nodes (1–4), IRSA enabled |
module.eks |
| VPC | Private subnets (10.0.1.0/24, 10.0.2.0/24), public subnets (10.0.101.0/24, 10.0.102.0/24), NAT gateway | module.vpc |
| WAF (Regional) | Common rules, SQLi, bad inputs, IP reputation, 2000 req/5min rate limit, geo-blocking | module.waf |
| WAF (CloudFront) | Common rules, bad inputs, IP reputation, rate limiting, block_all_traffic toggle |
module.waf |
| DNS | Route53 for zkprova.com |
module.dns |
| Monitoring | 4 security alarms + 7 operational alarms, 2 SNS topics, security dashboard | module.monitoring + module.observability |
Document Control¶
| Version | Date | Author | Description |
|---|---|---|---|
| 1.0 | 2026-02-28 | ZKProva Engineering | Initial business continuity plan |
This document satisfies SOC 2 Trust Service Criteria A1.3 (Recovery from Disruptions). It is reviewed quarterly and updated after each DR drill or significant infrastructure change.