Incident Response Plan¶
Organization: ZKProva System: ZKP-Powered Portable Credit Union Identity SOC 2 Criteria: CC7.2 (Incident Management), CC7.4 (Incident Communication) Document Version: 1.0 Effective Date: 2026-02-28 Classification: Confidential Review Cadence: Quarterly (next review: 2026-05-31)
Table of Contents¶
- Severity Classification
- Roles and Responsibilities
- Escalation Path
- Detection Sources
- Response Procedures
- Communication Templates
- Tabletop Scenarios
- Post-Incident Review
- Document Control
Severity Classification¶
| Level | Name | Definition | Response Time SLA | Examples |
|---|---|---|---|---|
| P0 | Critical | Confirmed data breach, complete service outage, or cryptographic key compromise | 15 minutes | Credential DB exfiltration, Ed25519 signing key leak, ZKP proof forgery, total API unavailability |
| P1 | High | Significant service degradation, potential security event, or partial outage | 1 hour | Sustained 5xx error rate >5%, WAF blocked request spike (>100/5min), authentication failure anomaly, RDS failover triggered |
| P2 | Medium | Limited impact, no data exposure, isolated component failure | 4 hours | Single endpoint returning errors, ElastiCache connection issues, elevated proof generation latency (p99 >5s), email delivery failures |
| P3 | Low | Cosmetic issues, minor anomalies, no user impact | Next business day | Non-critical log anomalies, UI rendering issues, Dependabot advisory on low-severity CVE |
Severity Determination Criteria¶
When classifying an incident, consider:
- Data exposure — Is member PII, credential data, or key material at risk?
- Service availability — Is the API serving requests? Are proofs verifiable?
- Blast radius — How many members/lenders are affected?
- Recoverability — Can the impact be reversed (e.g., key rotation, credential revocation)?
If uncertain, classify one level higher and downgrade after triage.
Roles and Responsibilities¶
Incident Commander (IC)¶
- Owns the incident lifecycle from detection to post-mortem closure
- Makes severity classification and escalation decisions
- Coordinates across engineering, communications, and leadership
- Ensures all actions are documented in the incident timeline
- Primary: CTO / Engineering Lead
- Backup: Senior Engineer on rotation
Engineering Lead¶
- Performs technical investigation using correlation IDs and structured logs
- Implements containment measures (API key revocation, rate limit adjustment, pod isolation)
- Executes recovery procedures (database restore, key rotation, Helm rollback)
- Produces root cause analysis for post-mortem
- Primary: On-call engineer
- Backup: Any backend engineer with production access
Communications Lead¶
- Drafts and sends internal status updates to #incidents channel
- Prepares external notifications for affected credit unions and lenders
- Updates status page with incident timeline and resolution ETA
- Manages regulatory notification obligations (72-hour window for personal data incidents)
- Primary: CEO / Head of Operations
- Backup: CTO
Escalation Path¶
Detection Source
|
v
CloudWatch Alarm / Sentry Alert / WAF Log / Audit Log Anomaly
|
v
PagerDuty (on-call engineer paged)
|
v
On-call engineer acknowledges (15 min SLA for P0)
|
v
Incident Commander assigned, severity confirmed
|
v
Slack #incidents channel created/updated
|
v
[If P0/P1] Leadership notified within 30 minutes
|
v
[If data breach] External notification within 72 hours
|
v
StatusPage updated (public-facing status)
Escalation Timing¶
| Action | P0 | P1 | P2 | P3 |
|---|---|---|---|---|
| On-call acknowledgment | 15 min | 30 min | 2 hours | Next business day |
| IC assigned | 15 min | 1 hour | 4 hours | N/A |
| Leadership notification | 30 min | 2 hours | Daily standup | N/A |
| External communication | 1 hour | 4 hours | As needed | N/A |
Detection Sources¶
CloudWatch Security Alarms¶
These alarms fire to the zkprova-security-alerts-prod SNS topic:
| Alarm | Metric | Threshold | Implied Severity |
|---|---|---|---|
zkprova-prod-auth-failure-spike |
zkprova_auth_operations_total{result=failure} anomaly detection (2 std dev) |
Anomaly band breach | P1 |
zkprova-prod-auth-failure-absolute |
zkprova_auth_operations_total{result=failure} |
>50 failures / 5 min | P1 |
zkprova-prod-security-high-5xx-rate |
5xx / total request % | >1% over 5 min | P1 |
zkprova-prod-waf-blocked-spike |
AWS/WAFV2 BlockedRequests |
>100 requests / 5 min | P1 (investigate for P0) |
CloudWatch Operational Alarms¶
These alarms fire to the zkprova-alerts-prod SNS topic:
| Alarm | Metric | Threshold |
|---|---|---|
zkprova-prod-high-5xx-rate |
5xx % | >5% / 5 min |
zkprova-prod-high-p99-latency |
HTTP request duration p99 | >5 seconds |
zkprova-prod-email-failures |
Email operation failures | >3 / 10 min |
zkprova-prod-db-pool-wait-high |
DB pool checkout wait p95 | >5 seconds |
zkprova-prod-pod-restarts |
Container restart count | >3 / 10 min |
zkprova-prod-zkp-proof-p99-high |
ZKP proof duration p99 | >5 seconds |
zkprova-rds-free-storage-low |
RDS free storage | <2 GB |
Additional Detection Sources¶
| Source | What It Detects | Alert Channel |
|---|---|---|
| Sentry | Application exceptions, unhandled errors | Email + Slack integration |
| WAF logs | SQLi, XSS, SSRF, bad bot, known-bad-input attempts | CloudWatch aws-waf-logs-zkprova-regional-prod (90-day retention) |
| Audit logs | Unusual access patterns, privilege escalation attempts, failed authorization | PostgreSQL audit_logs table (structured JSON, 7-year retention) |
| Dependabot | Dependency CVEs in Python (pip) and JavaScript (npm) | GitHub notification + PR |
| Trivy (CI) | Container image vulnerabilities (critical/high = build failure) | GitHub Actions check |
| Gitleaks (CI) | Leaked secrets in source code | GitHub Actions check |
CloudWatch Security Dashboard¶
The zkprova-prod-security dashboard provides real-time visibility into:
- WAF allowed vs. blocked request rates
- Authentication failure trends
- 5xx error rate
Response Procedures¶
Phase 1: Detection and Acknowledgment¶
- On-call engineer receives alert via PagerDuty
- Acknowledge the alert within the severity SLA
- Open a thread in Slack
#incidentswith initial details: - Alert name and source
- Timestamp (UTC)
- Preliminary severity assessment
Phase 2: Triage and Classification¶
- Confirm severity using the classification criteria above
- Assign Incident Commander (self for P2/P3; escalate for P0/P1)
- Check the CloudWatch security dashboard for correlated signals
- Pull relevant audit logs using
correlation_idfor request tracing:
Phase 3: Containment¶
Available containment actions (in order of increasing impact):
| Action | Command / Procedure | Impact |
|---|---|---|
| Revoke compromised API key | Admin API endpoint or direct DB update | Immediate; affected lender loses access |
| Tighten rate limits | Update Redis rate limit configuration | Reduces throughput for affected tier |
| Block IPs via WAF | Add IP to WAF block rule or enable geo-blocking | Immediate; blocks all traffic from source |
| Enable maintenance mode | Set WAF block_all_traffic flag via Terraform |
Full service outage (use only for P0) |
| Rotate database credentials | Update via AWS Secrets Manager + kubectl rollout restart |
Brief pod restart cycle |
| Rotate JWT signing key | Update via AWS Secrets Manager + kubectl rollout restart (note: @lru_cache in secrets.py requires restart) |
All existing JWTs invalidated; members must re-authenticate |
| Isolate pods | kubectl cordon <node> / kubectl drain <node> |
Reduced capacity on affected node |
| Helm rollback | helm rollback zkprova <revision> |
Reverts to previous deployment |
Phase 4: Eradication¶
- Identify root cause using structured logs (correlation IDs), Sentry stack traces, and audit trail
- Develop and test fix on staging (
api-staging.zkprova.com) - Deploy fix through standard CI/CD pipeline (PR → review → CI → merge → manual approval → deploy)
- For emergency changes: document justification, get verbal IC approval, deploy with expedited review
Phase 5: Recovery¶
- Verify service health:
kubectl rollout status deployment/zkprova-backend- Run smoke tests:
./scripts/smoke-test.sh - Check
/healthendpoint confirms DB connectivity - Verify CloudWatch alarms return to OK state
- If database restore is needed:
- RDS point-in-time recovery (RPO: 24 hours, RTO: 30 minutes)
- Restore script:
./scripts/test-rds-restore.sh(adapted for production) - Re-enable any disabled services or relaxed security controls
- Monitor for recurrence over next 24 hours
Phase 6: Post-Incident Review¶
See Post-Incident Review section below.
Communication Templates¶
Template 1: Internal Escalation¶
Channel: Slack #incidents
When: Immediately upon severity confirmation
INCIDENT DECLARED — [P0/P1/P2]
Time: [YYYY-MM-DD HH:MM UTC]
Incident Commander: [Name]
Summary: [One-line description of the incident]
Detection: [Alert name / source that triggered detection]
Impact:
- Services affected: [API / Frontend / Proofs / Webhooks]
- Users affected: [Estimated count or "all" / "none confirmed"]
- Data exposure: [Yes — describe / No / Under investigation]
Current status: [Investigating / Contained / Mitigated]
Next update: [Time of next scheduled update]
Template 2: Customer Notification (Credit Unions / Lenders)¶
Channel: Email to affected partners When: Within 4 hours for P0/P1; as needed for P2
Subject: ZKProva Service Incident — [Date]
Dear [Partner Name],
We are writing to inform you of a service incident affecting ZKProva.
What happened:
[Brief, non-technical description of the incident]
Impact to your organization:
[Specific impact — e.g., "Verification requests may have
experienced elevated latency between HH:MM and HH:MM UTC"]
What we are doing:
[Actions taken to resolve and prevent recurrence]
Data impact:
[Confirm whether any member data was accessed or exposed.
If yes, describe scope and remediation steps.]
Current status: [Resolved / Ongoing with ETA]
Next steps:
[Any action required from the partner, or "No action required"]
If you have questions, please contact us at security@zkprova.com
(4-hour response SLA).
Sincerely,
[Communications Lead Name]
ZKProva Security Team
Template 3: Post-Mortem Report¶
Channel: Shared document (linked from Slack #incidents)
When: Within 5 business days of incident resolution
POST-MORTEM: [Incident Title]
Date: [YYYY-MM-DD]
Severity: [P0/P1/P2]
Duration: [Start time — End time UTC]
Incident Commander: [Name]
Author: [Name]
TIMELINE (all times UTC)
HH:MM — [Event]
HH:MM — [Event]
...
ROOT CAUSE
[Technical description of what caused the incident]
IMPACT
- Duration: [X hours Y minutes]
- Users affected: [Count or description]
- Data exposure: [None / Description]
- Revenue impact: [None / Description]
DETECTION
- How was the incident detected? [Alert name / manual report]
- Time to detect: [Duration from incident start to first alert]
RESPONSE
- Time to acknowledge: [Duration]
- Time to contain: [Duration]
- Time to resolve: [Duration]
CONTRIBUTING FACTORS
1. [Factor]
2. [Factor]
CORRECTIVE ACTIONS
| Action | Owner | Due Date | Status |
|---|---|---|---|
| [Action item] | [Name] | [Date] | [Open/Done] |
LESSONS LEARNED
- What went well: [Description]
- What could be improved: [Description]
Tabletop Scenarios¶
These scenarios are exercised quarterly. Each exercise is documented with participants, decisions made, and improvements identified.
Scenario 1: Credential Database Breach¶
Setup: An attacker exploits a SQL injection vulnerability (bypassing Pydantic validation on a new, unvalidated endpoint) and exfiltrates the credentials table containing AES-256-GCM encrypted credential data and the members table containing email addresses and bcrypt-hashed passwords.
Discussion points:
- Detection: Which alarm fires first? (Likely
auth-failure-spikeif attacker tests credentials, or WAFAWSManagedRulesSQLiRuleSetif SQLi pattern is caught.) What if the exfiltration is slow and stays under rate limits? - Severity: P0 — confirmed data breach with member PII (emails) exposed even though credential data is encrypted.
- Containment: Block attacker IP via WAF. Rotate database credentials via Secrets Manager +
kubectl rollout restart. Enableblock_all_trafficWAF flag if active exploitation continues. - Eradication: Identify and patch the vulnerable endpoint. Deploy via emergency change process.
- Recovery: Are encrypted credentials safe? AES-256-GCM is secure if the encryption key was not also exfiltrated. Verify encryption key isolation (stored in AWS Secrets Manager, not in DB). Force password reset for all members. Revoke and re-issue all active credentials.
- Communication: Notify affected credit unions within 4 hours. Notify members whose emails were exposed. File regulatory notifications within 72 hours if GDPR/CCPA applies.
- Post-mortem: Add SAST/DAST to CI pipeline (Gap #8). Review all endpoints for input validation coverage.
Scenario 2: Complete Service Outage¶
Setup: A misconfigured Helm release pushes a broken backend image to production. All 3 backend pods fail readiness probes and are taken out of service. The API returns 503 for all requests. Frontend loads but cannot communicate with the backend.
Discussion points:
- Detection:
zkprova-prod-high-5xx-ratealarm fires within 5 minutes (threshold: >5%). Sentry floods with connection errors. Kubernetes events showCrashLoopBackOff. - Severity: P0 — complete service outage.
- Containment: Immediate Helm rollback:
helm rollback zkprova <previous-revision>. PDB (minAvailable=2) should have prevented this — investigate why it didn't catch the bad rollout. - Recovery: Verify rollback success with
kubectl rollout status. Run smoke tests (./scripts/smoke-test.sh). Check/healthendpoint confirms DB connectivity. Monitor for 1 hour. - Root cause: Was the broken image caught by CI? If CI passed, why did production fail? (Environment-specific config, missing secret, resource limits?)
- Prevention: Add post-deploy smoke tests to CI/CD pipeline (already implemented in Issue #101). Add canary deployment strategy. Review PDB configuration.
Scenario 3: ZKP Proof Forgery Attempt¶
Setup: A malicious lender submits a crafted verification request with a manipulated Groth16 proof that attempts to bypass the verification circuit. The proof claims a member has a credit score of 800 when the actual credential contains 620.
Discussion points:
- Detection: The Groth16 verifier rejects the proof mathematically — this is a cryptographic guarantee, not an application-level check. Audit log records
proof.verifiedwithoutcome=failure. If attempts are repeated,auth-failure-spikeor rate limiting may trigger. - Severity: P2 initially (proof correctly rejected, no data exposure). Escalate to P1 if the pattern suggests a sophisticated attack targeting the circuit or proving system.
- Investigation: Review the verification key and circuit. Is the attacker using a different circuit (trusted setup mismatch)? Are they attempting to exploit a known Groth16 vulnerability? Check if snarkjs version has known CVEs.
- Containment: Rate-limit or block the lender's API key. Review all recent verifications from this lender for anomalies.
- Escalation: If the proof passes verification with incorrect claims, this is P0 — the circuit or trusted setup is compromised. Immediately halt all verifications (
block_all_traffic). Engage cryptography experts (Trail of Bits, preferred pentest firm). - Communication: Notify all credit unions that verification integrity may be compromised. Suspend credential acceptance until circuit audit is complete.
Post-Incident Review¶
Process¶
- Schedule: Post-mortem meeting held within 3 business days of incident resolution
- Participants: IC, Engineering Lead, Communications Lead, and any engineers involved in response
- Format: Blameless — focus on systems and processes, not individuals
- Output: Post-mortem document (Template 3 above) with corrective action items
- Tracking: Action items created as GitHub Issues and tracked to completion
- Review: Corrective actions reviewed in next quarterly security review
Metrics Tracked¶
| Metric | Target | Measurement |
|---|---|---|
| Mean Time to Detect (MTTD) | <5 minutes for P0/P1 | Time from incident start to first alert |
| Mean Time to Acknowledge (MTTA) | <15 minutes for P0 | Time from alert to human acknowledgment |
| Mean Time to Contain (MTTC) | <1 hour for P0 | Time from acknowledgment to containment |
| Mean Time to Resolve (MTTR) | <4 hours for P0 | Time from detection to full resolution |
| Post-mortem completion rate | 100% for P0/P1 | Post-mortems completed within 5 business days |
Document Control¶
| Version | Date | Author | Description |
|---|---|---|---|
| 1.0 | 2026-02-28 | ZKProva Engineering | Initial incident response plan |
This document satisfies SOC 2 Trust Service Criteria CC7.2 (Incident Management) and CC7.4 (Incident Communication). It is reviewed quarterly and updated after each P0/P1 incident post-mortem.