Skip to content

Business Continuity Plan

Organization: ZKProva System: ZKP-Powered Portable Credit Union Identity SOC 2 Criteria: A1.3 (Recovery from Disruptions) Document Version: 1.0 Effective Date: 2026-02-28 Classification: Confidential Review Cadence: Quarterly (next review: 2026-05-31)


Table of Contents

  1. Recovery Objectives
  2. System Component Recovery
  3. External Dependency Map
  4. Failover Procedures
  5. Testing Cadence
  6. Production Configuration Reference
  7. Document Control

Recovery Objectives

Component RPO (Data Loss Tolerance) RTO (Recovery Time) Justification
PostgreSQL (RDS) 24 hours 30 minutes Automated daily backups at 03:00–04:00 UTC with 7-day retention. Point-in-time recovery available within backup window. Multi-AZ failover for infrastructure failures.
Redis (ElastiCache) 0 (ephemeral) 5 minutes Redis stores only rate-limit counters and session cache. Data loss is non-catastrophic — counters reset and sessions require re-authentication. Rebuilt from scratch on failure.
EKS (Application Tier) 0 (stateless) 15 minutes Application pods are stateless. All state lives in RDS and Redis. Recovery = new pod scheduling via Kubernetes. PDB ensures minimum availability during disruptions.
AWS Secrets Manager 0 (managed) 10 minutes AWS-managed durability. Secrets cached at application startup via @lru_cache. Recovery requires kubectl rollout restart to refresh cache if secrets are rotated.
TLS Certificates (ACM) 0 (managed) 0 (auto-renewal) AWS Certificate Manager handles renewal automatically. Certificate covers *.zkprova.com with DNS validation.

Recovery Priority Order

In a multi-component failure, recover in this order:

  1. RDS PostgreSQL — All business data, credentials, and audit logs depend on it
  2. AWS Secrets Manager — Required for application startup (JWT key, encryption key, Ed25519 key)
  3. EKS Application Pods — Require DB and secrets to be available
  4. ElastiCache Redis — Rate limiting and session cache; non-critical for core functionality
  5. Frontend — Can operate in degraded mode while backend recovers

System Component Recovery

PostgreSQL (RDS)

Instance: zkprova-db | Engine: PostgreSQL 16 | Storage: gp3, encrypted Multi-AZ: Enabled | Backup Window: 03:00–04:00 UTC | Retention: 7 days

Automatic Failover (Infrastructure Failure)

RDS Multi-AZ provides automatic failover to the standby replica:

  • Trigger: Primary instance failure, AZ outage, or instance type change
  • Duration: 60–120 seconds
  • Action required: None — DNS endpoint automatically re-routes
  • Verification: Check /health endpoint (includes DB connectivity check). Monitor zkprova-rds-free-storage-low alarm.

Manual Point-in-Time Recovery (Data Corruption / Breach)

# 1. Identify target recovery time
aws rds describe-db-instances --db-instance-identifier zkprova-db \
  --query 'DBInstances[0].LatestRestorableTime'

# 2. Restore to new instance
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier zkprova-db \
  --target-db-instance-identifier zkprova-db-recovery \
  --restore-time "2026-02-28T02:00:00Z" \
  --db-subnet-group-name zkprova-db-subnet \
  --vpc-security-group-ids <sg-id>

# 3. Verify restored data integrity
psql -h zkprova-db-recovery.<region>.rds.amazonaws.com \
  -U zkprova -d zkprova \
  -c "SELECT COUNT(*) FROM members; SELECT COUNT(*) FROM credentials; SELECT COUNT(*) FROM audit_logs;"

# 4. Swap DNS or update application config to point to recovered instance
# 5. Delete original compromised instance after verification

Existing automation: Weekly restore test via ./scripts/test-rds-restore.sh — restores latest snapshot to a temporary instance, runs smoke tests, and deletes the temporary instance.

ElastiCache (Redis)

Replication Group: zkprova-redis | Engine: Redis 7.0 | Node: cache.t3.micro Encryption: At-rest and in-transit enabled | Auth: Token required

Recovery Procedure

Redis stores only ephemeral data (rate-limit sliding windows, session tokens). On failure:

  1. ElastiCache automatically replaces failed nodes
  2. If full replication group failure: Terraform re-applies the ElastiCache module
  3. Impact of Redis loss:
  4. Rate limiting temporarily disabled (falls back to application-level limiting)
  5. Active sessions invalidated (members must re-authenticate)
  6. No business data loss

EKS (Application Tier)

Cluster: zkprova | Version: 1.29 | Nodes: t3.medium, min 1 / max 4

Self-Healing Mechanisms

Mechanism Configuration Effect
Liveness probes HTTP check on /health Restarts unresponsive pods
Readiness probes HTTP check on /health (includes DB check) Removes unhealthy pods from service
HPA (backend) min 3, max 20, target 60% CPU Scales capacity with load
HPA (frontend) min 3, max 10 Scales frontend independently
PDB (backend) minAvailable: 2 Prevents disruption below 2 healthy pods
PDB (frontend) minAvailable: 2 Prevents disruption below 2 healthy pods

Pod Recovery

Pods are stateless. Recovery is automatic via Kubernetes scheduler:

  • Single pod failure: Replaced within seconds by ReplicaSet controller
  • Node failure: Pods rescheduled to healthy nodes (EKS auto-scaling group provisions replacement node)
  • Full cluster failure: Re-provision via Terraform (module.eks), redeploy via Helm

Application Rollback

# List release history
helm history zkprova -n default

# Rollback to previous release
helm rollback zkprova <revision> -n default

# Verify rollback
kubectl rollout status deployment/zkprova-backend
kubectl rollout status deployment/zkprova-frontend
./scripts/smoke-test.sh

WAF Emergency Controls

The CloudFront WAF includes a block_all_traffic toggle for emergency maintenance mode:

# In terraform/modules/waf/main.tf
# Set block_all_traffic = true to block ALL incoming requests
# Use only during active P0 incidents requiring full service isolation

Apply via Terraform:

terraform apply -var="block_all_traffic=true" -target=module.waf


External Dependency Map

Dependency Purpose Failure Impact Mitigation
AWS RDS Primary database Complete service outage Multi-AZ failover (60–120s), PITR backups
AWS ElastiCache Rate limiting, session cache Degraded rate limiting, session loss Ephemeral data only; auto-replacement
AWS Secrets Manager JWT key, AES key, Ed25519 key, DB password Pods cannot start without secrets Secrets cached via @lru_cache at startup; running pods unaffected by brief outages
AWS ACM TLS certificates (*.zkprova.com) TLS termination failure Auto-renewal; long validity period
AWS WAF Request filtering, rate limiting, geo-blocking Reduced security posture Application-level rate limiting as fallback
AWS SES Email delivery (verification, notifications) Email features unavailable zkprova-prod-email-failures alarm (>3 failures/10min). Non-critical path — core ZKP operations unaffected.
GitHub Source control, CI/CD Cannot deploy new versions Running production unaffected. Manual deploy possible via local kubectl.
GitHub Container Registry Docker image storage Cannot pull new images Existing images cached on EKS nodes. Running pods unaffected.
Expo Mobile app OTA updates Cannot push mobile updates Mobile app functions offline with cached credentials.
NCUA API Credit union charter validation Cannot validate new issuers Fail-closed: new issuer registration blocked. Existing issuers unaffected. Grace period for transient failures.
snarkjs ZKP proof generation/verification Proof operations fail Bundled as npm dependency — no external call. Failure = application bug, not dependency outage.
Ed25519 keys Credential signing (DID:key) Cannot issue new credentials Keys in Secrets Manager. Existing credentials remain valid. See Key Rotation Procedures.

Single Points of Failure

Component Risk Mitigation Status
RDS Primary AZ-level failure Mitigated: Multi-AZ enabled
EKS Control Plane AWS-managed HA Mitigated: AWS manages multi-AZ control plane
Secrets Manager Regional failure Partial: No cross-region replication configured
Ed25519 Signing Key Key compromise Partial: No HSM (Gap #6, target Q3 2026). Key in Secrets Manager with IAM access control.

Failover Procedures

Scenario: RDS Primary Failure

  1. Automatic: RDS promotes standby to primary (60–120 seconds)
  2. Verify: /health endpoint returns 200 with db: connected
  3. Monitor: Check zkprova-rds-free-storage-low alarm. Verify replication lag is zero.
  4. No action required unless failover does not complete within 5 minutes — then engage AWS Support.

Scenario: Complete AZ Failure

  1. Automatic: EKS reschedules pods to nodes in surviving AZ
  2. Automatic: RDS fails over to standby in surviving AZ
  3. Verify: kubectl get pods -o wide — confirm pods running in healthy AZ
  4. Monitor: HPA may scale up to compensate for reduced node capacity

Scenario: EKS Node Exhaustion

  1. Automatic: Cluster Autoscaler provisions new nodes (up to max 4)
  2. If max reached: Evaluate increasing max_size in Terraform module.eks
  3. Emergency: Manually scale node group via AWS Console or aws eks update-nodegroup-config

Scenario: Secrets Manager Outage

  1. Impact: Running pods are unaffected (secrets cached via @lru_cache)
  2. Risk: New pods cannot start; kubectl rollout restart will fail
  3. Mitigation: Do not restart pods during Secrets Manager outage
  4. Recovery: Once Secrets Manager recovers, verify pods can fetch secrets by triggering a controlled restart of a single pod

Testing Cadence

Test Frequency Procedure Last Tested Next Test
RDS snapshot restore Weekly (automated) ./scripts/test-rds-restore.sh — restores latest snapshot to temp instance, runs smoke tests, deletes temp instance Ongoing Continuous
Helm rollback Per deployment Post-deploy verification includes rollback readiness check Per deploy Per deploy
Full DR drill Quarterly Simulate RDS failure + full application recovery. Measure actual RTO/RPO against targets. Document gaps. Not yet conducted Q2 2026
BCP tabletop exercise Annually Walk through complete AZ failure scenario with all stakeholders. Review dependency map and failover procedures. Not yet conducted Q3 2026
Secrets rotation Per rotation event Verify application recovers after kubectl rollout restart following secret update Per rotation Per rotation

DR Drill Procedure (Quarterly)

  1. Pre-drill: Notify team. Ensure staging environment mirrors production config.
  2. Execute on staging (api-staging.zkprova.com):
  3. Terminate RDS primary (simulate failure)
  4. Verify automatic Multi-AZ failover
  5. Measure actual failover duration
  6. Run smoke tests against recovered environment
  7. Perform PITR to verify backup integrity
  8. Document: Actual RTO/RPO achieved, any deviations from expected behavior
  9. Remediate: Create GitHub Issues for any gaps discovered

Production Configuration Reference

Helm Production Values (values-prod.yaml)

Parameter Value Purpose
Backend replicas 3 Base capacity for production load
Backend HPA min/max 3 / 20 Auto-scaling range
Backend HPA CPU target 60% Scale-up trigger
Backend PDB minAvailable 2 Disruption budget — always at least 2 pods serving
Frontend replicas 3 Base capacity
Frontend HPA min/max 3 / 10 Auto-scaling range
Frontend PDB minAvailable 2 Disruption budget
DB pool size 50 Connection pool per pod
DB max overflow 30 Burst connections per pod

Terraform Infrastructure

Resource Configuration Module
RDS instance zkprova-db, PostgreSQL 16, gp3 encrypted, Multi-AZ, 7-day backup retention module.rds
ElastiCache zkprova-redis, Redis 7.0, cache.t3.micro, encryption at rest + transit, auth token module.elasticache
EKS cluster zkprova, v1.29, t3.medium nodes (1–4), IRSA enabled module.eks
VPC Private subnets (10.0.1.0/24, 10.0.2.0/24), public subnets (10.0.101.0/24, 10.0.102.0/24), NAT gateway module.vpc
WAF (Regional) Common rules, SQLi, bad inputs, IP reputation, 2000 req/5min rate limit, geo-blocking module.waf
WAF (CloudFront) Common rules, bad inputs, IP reputation, rate limiting, block_all_traffic toggle module.waf
DNS Route53 for zkprova.com module.dns
Monitoring 4 security alarms + 7 operational alarms, 2 SNS topics, security dashboard module.monitoring + module.observability

Document Control

Version Date Author Description
1.0 2026-02-28 ZKProva Engineering Initial business continuity plan

This document satisfies SOC 2 Trust Service Criteria A1.3 (Recovery from Disruptions). It is reviewed quarterly and updated after each DR drill or significant infrastructure change.