Business Continuity Plan¶

Organization: ZKProva System: ZKP-Powered Portable Credit Union Identity SOC 2 Criteria: A1.3 (Recovery from Disruptions) Document Version: 1.0 Effective Date: 2026-02-28 Classification: Confidential Review Cadence: Quarterly (next review: 2026-05-31)

Recovery Objectives¶

Component	RPO (Data Loss Tolerance)	RTO (Recovery Time)	Justification
PostgreSQL (RDS)	24 hours	30 minutes	Automated daily backups at 03:00–04:00 UTC with 7-day retention. Point-in-time recovery available within backup window. Multi-AZ failover for infrastructure failures.
Redis (ElastiCache)	0 (ephemeral)	5 minutes	Redis stores only rate-limit counters and session cache. Data loss is non-catastrophic — counters reset and sessions require re-authentication. Rebuilt from scratch on failure.
EKS (Application Tier)	0 (stateless)	15 minutes	Application pods are stateless. All state lives in RDS and Redis. Recovery = new pod scheduling via Kubernetes. PDB ensures minimum availability during disruptions.
AWS Secrets Manager	0 (managed)	10 minutes	AWS-managed durability. Secrets cached at application startup via `@lru_cache`. Recovery requires `kubectl rollout restart` to refresh cache if secrets are rotated.
TLS Certificates (ACM)	0 (managed)	0 (auto-renewal)	AWS Certificate Manager handles renewal automatically. Certificate covers `*.zkprova.com` with DNS validation.

Recovery Priority Order¶

In a multi-component failure, recover in this order:

RDS PostgreSQL — All business data, credentials, and audit logs depend on it
AWS Secrets Manager — Required for application startup (JWT key, encryption key, Ed25519 key)
EKS Application Pods — Require DB and secrets to be available
ElastiCache Redis — Rate limiting and session cache; non-critical for core functionality
Frontend — Can operate in degraded mode while backend recovers

System Component Recovery¶

PostgreSQL (RDS)¶

Instance: zkprova-db | Engine: PostgreSQL 16 | Storage: gp3, encrypted Multi-AZ: Enabled | Backup Window: 03:00–04:00 UTC | Retention: 7 days

Automatic Failover (Infrastructure Failure)¶

RDS Multi-AZ provides automatic failover to the standby replica:

Trigger: Primary instance failure, AZ outage, or instance type change
Duration: 60–120 seconds
Action required: None — DNS endpoint automatically re-routes
Verification: Check /health endpoint (includes DB connectivity check). Monitor zkprova-rds-free-storage-low alarm.

Manual Point-in-Time Recovery (Data Corruption / Breach)¶

# 1. Identify target recovery time
aws rds describe-db-instances --db-instance-identifier zkprova-db \
  --query 'DBInstances[0].LatestRestorableTime'

# 2. Restore to new instance
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier zkprova-db \
  --target-db-instance-identifier zkprova-db-recovery \
  --restore-time "2026-02-28T02:00:00Z" \
  --db-subnet-group-name zkprova-db-subnet \
  --vpc-security-group-ids <sg-id>

# 3. Verify restored data integrity
psql -h zkprova-db-recovery.<region>.rds.amazonaws.com \
  -U zkprova -d zkprova \
  -c "SELECT COUNT(*) FROM members; SELECT COUNT(*) FROM credentials; SELECT COUNT(*) FROM audit_logs;"

# 4. Swap DNS or update application config to point to recovered instance
# 5. Delete original compromised instance after verification

Existing automation: Weekly restore test via ./scripts/test-rds-restore.sh — restores latest snapshot to a temporary instance, runs smoke tests, and deletes the temporary instance.

ElastiCache (Redis)¶

Replication Group: zkprova-redis | Engine: Redis 7.0 | Node: cache.t3.micro Encryption: At-rest and in-transit enabled | Auth: Token required

Recovery Procedure¶

Redis stores only ephemeral data (rate-limit sliding windows, session tokens). On failure:

ElastiCache automatically replaces failed nodes
If full replication group failure: Terraform re-applies the ElastiCache module
Impact of Redis loss:
Rate limiting temporarily disabled (falls back to application-level limiting)
Active sessions invalidated (members must re-authenticate)
No business data loss

EKS (Application Tier)¶

Cluster: zkprova | Version: 1.29 | Nodes: t3.medium, min 1 / max 4

Self-Healing Mechanisms¶

Mechanism	Configuration	Effect
Liveness probes	HTTP check on `/health`	Restarts unresponsive pods
Readiness probes	HTTP check on `/health` (includes DB check)	Removes unhealthy pods from service
HPA (backend)	min 3, max 20, target 60% CPU	Scales capacity with load
HPA (frontend)	min 3, max 10	Scales frontend independently
PDB (backend)	minAvailable: 2	Prevents disruption below 2 healthy pods
PDB (frontend)	minAvailable: 2	Prevents disruption below 2 healthy pods

Pod Recovery¶

Pods are stateless. Recovery is automatic via Kubernetes scheduler:

Single pod failure: Replaced within seconds by ReplicaSet controller
Node failure: Pods rescheduled to healthy nodes (EKS auto-scaling group provisions replacement node)
Full cluster failure: Re-provision via Terraform (module.eks), redeploy via Helm

Application Rollback¶

# List release history
helm history zkprova -n default

# Rollback to previous release
helm rollback zkprova <revision> -n default

# Verify rollback
kubectl rollout status deployment/zkprova-backend
kubectl rollout status deployment/zkprova-frontend
./scripts/smoke-test.sh

WAF Emergency Controls¶

The CloudFront WAF includes a block_all_traffic toggle for emergency maintenance mode:

# In terraform/modules/waf/main.tf
# Set block_all_traffic = true to block ALL incoming requests
# Use only during active P0 incidents requiring full service isolation

Apply via Terraform:

terraform apply -var="block_all_traffic=true" -target=module.waf

External Dependency Map¶

Dependency	Purpose	Failure Impact	Mitigation
AWS RDS	Primary database	Complete service outage	Multi-AZ failover (60–120s), PITR backups
AWS ElastiCache	Rate limiting, session cache	Degraded rate limiting, session loss	Ephemeral data only; auto-replacement
AWS Secrets Manager	JWT key, AES key, Ed25519 key, DB password	Pods cannot start without secrets	Secrets cached via `@lru_cache` at startup; running pods unaffected by brief outages
AWS ACM	TLS certificates (`*.zkprova.com`)	TLS termination failure	Auto-renewal; long validity period
AWS WAF	Request filtering, rate limiting, geo-blocking	Reduced security posture	Application-level rate limiting as fallback
AWS SES	Email delivery (verification, notifications)	Email features unavailable	`zkprova-prod-email-failures` alarm (>3 failures/10min). Non-critical path — core ZKP operations unaffected.
GitHub	Source control, CI/CD	Cannot deploy new versions	Running production unaffected. Manual deploy possible via local `kubectl`.
GitHub Container Registry	Docker image storage	Cannot pull new images	Existing images cached on EKS nodes. Running pods unaffected.
Expo	Mobile app OTA updates	Cannot push mobile updates	Mobile app functions offline with cached credentials.
NCUA API	Credit union charter validation	Cannot validate new issuers	Fail-closed: new issuer registration blocked. Existing issuers unaffected. Grace period for transient failures.
snarkjs	ZKP proof generation/verification	Proof operations fail	Bundled as npm dependency — no external call. Failure = application bug, not dependency outage.
Ed25519 keys	Credential signing (DID:key)	Cannot issue new credentials	Keys in Secrets Manager. Existing credentials remain valid. See Key Rotation Procedures.

Single Points of Failure¶

Component	Risk	Mitigation Status
RDS Primary	AZ-level failure	Mitigated: Multi-AZ enabled
EKS Control Plane	AWS-managed HA	Mitigated: AWS manages multi-AZ control plane
Secrets Manager	Regional failure	Partial: No cross-region replication configured
Ed25519 Signing Key	Key compromise	Partial: No HSM (Gap #6, target Q3 2026). Key in Secrets Manager with IAM access control.

Failover Procedures¶

Scenario: RDS Primary Failure¶

Automatic: RDS promotes standby to primary (60–120 seconds)
Verify: /health endpoint returns 200 with db: connected
Monitor: Check zkprova-rds-free-storage-low alarm. Verify replication lag is zero.
No action required unless failover does not complete within 5 minutes — then engage AWS Support.

Scenario: Complete AZ Failure¶

Automatic: EKS reschedules pods to nodes in surviving AZ
Automatic: RDS fails over to standby in surviving AZ
Verify: kubectl get pods -o wide — confirm pods running in healthy AZ
Monitor: HPA may scale up to compensate for reduced node capacity

Scenario: EKS Node Exhaustion¶

Automatic: Cluster Autoscaler provisions new nodes (up to max 4)
If max reached: Evaluate increasing max_size in Terraform module.eks
Emergency: Manually scale node group via AWS Console or aws eks update-nodegroup-config

Scenario: Secrets Manager Outage¶

Impact: Running pods are unaffected (secrets cached via @lru_cache)
Risk: New pods cannot start; kubectl rollout restart will fail
Mitigation: Do not restart pods during Secrets Manager outage
Recovery: Once Secrets Manager recovers, verify pods can fetch secrets by triggering a controlled restart of a single pod

Testing Cadence¶

Test	Frequency	Procedure	Last Tested	Next Test
RDS snapshot restore	Weekly (automated)	`./scripts/test-rds-restore.sh` — restores latest snapshot to temp instance, runs smoke tests, deletes temp instance	Ongoing	Continuous
Helm rollback	Per deployment	Post-deploy verification includes rollback readiness check	Per deploy	Per deploy
Full DR drill	Quarterly	Simulate RDS failure + full application recovery. Measure actual RTO/RPO against targets. Document gaps.	Not yet conducted	Q2 2026
BCP tabletop exercise	Annually	Walk through complete AZ failure scenario with all stakeholders. Review dependency map and failover procedures.	Not yet conducted	Q3 2026
Secrets rotation	Per rotation event	Verify application recovers after `kubectl rollout restart` following secret update	Per rotation	Per rotation

DR Drill Procedure (Quarterly)¶

Pre-drill: Notify team. Ensure staging environment mirrors production config.
Execute on staging (api-staging.zkprova.com):
Terminate RDS primary (simulate failure)
Verify automatic Multi-AZ failover
Measure actual failover duration
Run smoke tests against recovered environment
Perform PITR to verify backup integrity
Document: Actual RTO/RPO achieved, any deviations from expected behavior
Remediate: Create GitHub Issues for any gaps discovered

Production Configuration Reference¶

Helm Production Values (`values-prod.yaml`)¶

Parameter	Value	Purpose
Backend replicas	3	Base capacity for production load
Backend HPA min/max	3 / 20	Auto-scaling range
Backend HPA CPU target	60%	Scale-up trigger
Backend PDB minAvailable	2	Disruption budget — always at least 2 pods serving
Frontend replicas	3	Base capacity
Frontend HPA min/max	3 / 10	Auto-scaling range
Frontend PDB minAvailable	2	Disruption budget
DB pool size	50	Connection pool per pod
DB max overflow	30	Burst connections per pod

Terraform Infrastructure¶

Resource	Configuration	Module
RDS instance	`zkprova-db`, PostgreSQL 16, gp3 encrypted, Multi-AZ, 7-day backup retention	`module.rds`
ElastiCache	`zkprova-redis`, Redis 7.0, `cache.t3.micro`, encryption at rest + transit, auth token	`module.elasticache`
EKS cluster	`zkprova`, v1.29, t3.medium nodes (1–4), IRSA enabled	`module.eks`
VPC	Private subnets (10.0.1.0/24, 10.0.2.0/24), public subnets (10.0.101.0/24, 10.0.102.0/24), NAT gateway	`module.vpc`
WAF (Regional)	Common rules, SQLi, bad inputs, IP reputation, 2000 req/5min rate limit, geo-blocking	`module.waf`
WAF (CloudFront)	Common rules, bad inputs, IP reputation, rate limiting, `block_all_traffic` toggle	`module.waf`
DNS	Route53 for `zkprova.com`	`module.dns`
Monitoring	4 security alarms + 7 operational alarms, 2 SNS topics, security dashboard	`module.monitoring` + `module.observability`

Document Control¶

Version	Date	Author	Description
1.0	2026-02-28	ZKProva Engineering	Initial business continuity plan

This document satisfies SOC 2 Trust Service Criteria A1.3 (Recovery from Disruptions). It is reviewed quarterly and updated after each DR drill or significant infrastructure change.