Incident Response Plan¶

Organization: ZKProva System: ZKP-Powered Portable Credit Union Identity SOC 2 Criteria: CC7.2 (Incident Management), CC7.4 (Incident Communication) Document Version: 1.0 Effective Date: 2026-02-28 Classification: Confidential Review Cadence: Quarterly (next review: 2026-05-31)

Severity Classification¶

Level	Name	Definition	Response Time SLA	Examples
P0	Critical	Confirmed data breach, complete service outage, or cryptographic key compromise	15 minutes	Credential DB exfiltration, Ed25519 signing key leak, ZKP proof forgery, total API unavailability
P1	High	Significant service degradation, potential security event, or partial outage	1 hour	Sustained 5xx error rate >5%, WAF blocked request spike (>100/5min), authentication failure anomaly, RDS failover triggered
P2	Medium	Limited impact, no data exposure, isolated component failure	4 hours	Single endpoint returning errors, ElastiCache connection issues, elevated proof generation latency (p99 >5s), email delivery failures
P3	Low	Cosmetic issues, minor anomalies, no user impact	Next business day	Non-critical log anomalies, UI rendering issues, Dependabot advisory on low-severity CVE

Severity Determination Criteria¶

When classifying an incident, consider:

Data exposure — Is member PII, credential data, or key material at risk?
Service availability — Is the API serving requests? Are proofs verifiable?
Blast radius — How many members/lenders are affected?
Recoverability — Can the impact be reversed (e.g., key rotation, credential revocation)?

If uncertain, classify one level higher and downgrade after triage.

Roles and Responsibilities¶

Incident Commander (IC)¶

Owns the incident lifecycle from detection to post-mortem closure
Makes severity classification and escalation decisions
Coordinates across engineering, communications, and leadership
Ensures all actions are documented in the incident timeline
Primary: CTO / Engineering Lead
Backup: Senior Engineer on rotation

Engineering Lead¶

Performs technical investigation using correlation IDs and structured logs
Implements containment measures (API key revocation, rate limit adjustment, pod isolation)
Executes recovery procedures (database restore, key rotation, Helm rollback)
Produces root cause analysis for post-mortem
Primary: On-call engineer
Backup: Any backend engineer with production access

Communications Lead¶

Drafts and sends internal status updates to #incidents channel
Prepares external notifications for affected credit unions and lenders
Updates status page with incident timeline and resolution ETA
Manages regulatory notification obligations (72-hour window for personal data incidents)
Primary: CEO / Head of Operations
Backup: CTO

Escalation Path¶

Detection Source
  |
  v
CloudWatch Alarm / Sentry Alert / WAF Log / Audit Log Anomaly
  |
  v
PagerDuty (on-call engineer paged)
  |
  v
On-call engineer acknowledges (15 min SLA for P0)
  |
  v
Incident Commander assigned, severity confirmed
  |
  v
Slack #incidents channel created/updated
  |
  v
[If P0/P1] Leadership notified within 30 minutes
  |
  v
[If data breach] External notification within 72 hours
  |
  v
StatusPage updated (public-facing status)

Escalation Timing¶

Action	P0	P1	P2	P3
On-call acknowledgment	15 min	30 min	2 hours	Next business day
IC assigned	15 min	1 hour	4 hours	N/A
Leadership notification	30 min	2 hours	Daily standup	N/A
External communication	1 hour	4 hours	As needed	N/A

Detection Sources¶

CloudWatch Security Alarms¶

These alarms fire to the zkprova-security-alerts-prod SNS topic:

Alarm	Metric	Threshold	Implied Severity
`zkprova-prod-auth-failure-spike`	`zkprova_auth_operations_total{result=failure}` anomaly detection (2 std dev)	Anomaly band breach	P1
`zkprova-prod-auth-failure-absolute`	`zkprova_auth_operations_total{result=failure}`	>50 failures / 5 min	P1
`zkprova-prod-security-high-5xx-rate`	5xx / total request %	>1% over 5 min	P1
`zkprova-prod-waf-blocked-spike`	`AWS/WAFV2 BlockedRequests`	>100 requests / 5 min	P1 (investigate for P0)

CloudWatch Operational Alarms¶

These alarms fire to the zkprova-alerts-prod SNS topic:

Alarm	Metric	Threshold
`zkprova-prod-high-5xx-rate`	5xx %	>5% / 5 min
`zkprova-prod-high-p99-latency`	HTTP request duration p99	>5 seconds
`zkprova-prod-email-failures`	Email operation failures	>3 / 10 min
`zkprova-prod-db-pool-wait-high`	DB pool checkout wait p95	>5 seconds
`zkprova-prod-pod-restarts`	Container restart count	>3 / 10 min
`zkprova-prod-zkp-proof-p99-high`	ZKP proof duration p99	>5 seconds
`zkprova-rds-free-storage-low`	RDS free storage	<2 GB

Additional Detection Sources¶

Source	What It Detects	Alert Channel
Sentry	Application exceptions, unhandled errors	Email + Slack integration
WAF logs	SQLi, XSS, SSRF, bad bot, known-bad-input attempts	CloudWatch `aws-waf-logs-zkprova-regional-prod` (90-day retention)
Audit logs	Unusual access patterns, privilege escalation attempts, failed authorization	PostgreSQL `audit_logs` table (structured JSON, 7-year retention)
Dependabot	Dependency CVEs in Python (pip) and JavaScript (npm)	GitHub notification + PR
Trivy (CI)	Container image vulnerabilities (critical/high = build failure)	GitHub Actions check
Gitleaks (CI)	Leaked secrets in source code	GitHub Actions check

CloudWatch Security Dashboard¶

The zkprova-prod-security dashboard provides real-time visibility into: - WAF allowed vs. blocked request rates - Authentication failure trends - 5xx error rate

Response Procedures¶

Phase 1: Detection and Acknowledgment¶

On-call engineer receives alert via PagerDuty
Acknowledge the alert within the severity SLA
Open a thread in Slack #incidents with initial details:
Alert name and source
Timestamp (UTC)
Preliminary severity assessment

Phase 2: Triage and Classification¶

Confirm severity using the classification criteria above
Assign Incident Commander (self for P2/P3; escalate for P0/P1)
Check the CloudWatch security dashboard for correlated signals

Pull relevant audit logs using correlation_id for request tracing:

SELECT * FROM audit_logs
WHERE timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC;

Phase 3: Containment¶

Available containment actions (in order of increasing impact):

Action	Command / Procedure	Impact
Revoke compromised API key	Admin API endpoint or direct DB update	Immediate; affected lender loses access
Tighten rate limits	Update Redis rate limit configuration	Reduces throughput for affected tier
Block IPs via WAF	Add IP to WAF block rule or enable geo-blocking	Immediate; blocks all traffic from source
Enable maintenance mode	Set WAF `block_all_traffic` flag via Terraform	Full service outage (use only for P0)
Rotate database credentials	Update via AWS Secrets Manager + `kubectl rollout restart`	Brief pod restart cycle
Rotate JWT signing key	Update via AWS Secrets Manager + `kubectl rollout restart` (note: `@lru_cache` in `secrets.py` requires restart)	All existing JWTs invalidated; members must re-authenticate
Isolate pods	`kubectl cordon <node>` / `kubectl drain <node>`	Reduced capacity on affected node
Helm rollback	`helm rollback zkprova <revision>`	Reverts to previous deployment

Phase 4: Eradication¶

Identify root cause using structured logs (correlation IDs), Sentry stack traces, and audit trail
Develop and test fix on staging (api-staging.zkprova.com)
Deploy fix through standard CI/CD pipeline (PR → review → CI → merge → manual approval → deploy)
For emergency changes: document justification, get verbal IC approval, deploy with expedited review

Phase 5: Recovery¶

Verify service health:
kubectl rollout status deployment/zkprova-backend
Run smoke tests: ./scripts/smoke-test.sh
Check /health endpoint confirms DB connectivity
Verify CloudWatch alarms return to OK state
If database restore is needed:
RDS point-in-time recovery (RPO: 24 hours, RTO: 30 minutes)
Restore script: ./scripts/test-rds-restore.sh (adapted for production)
Re-enable any disabled services or relaxed security controls
Monitor for recurrence over next 24 hours

Phase 6: Post-Incident Review¶

See Post-Incident Review section below.

Communication Templates¶

Template 1: Internal Escalation¶

Channel: Slack #incidents When: Immediately upon severity confirmation

INCIDENT DECLARED — [P0/P1/P2]
Time: [YYYY-MM-DD HH:MM UTC]
Incident Commander: [Name]

Summary: [One-line description of the incident]

Detection: [Alert name / source that triggered detection]

Impact:
- Services affected: [API / Frontend / Proofs / Webhooks]
- Users affected: [Estimated count or "all" / "none confirmed"]
- Data exposure: [Yes — describe / No / Under investigation]

Current status: [Investigating / Contained / Mitigated]

Next update: [Time of next scheduled update]

Template 2: Customer Notification (Credit Unions / Lenders)¶

Channel: Email to affected partners When: Within 4 hours for P0/P1; as needed for P2

Subject: ZKProva Service Incident — [Date]

Dear [Partner Name],

We are writing to inform you of a service incident affecting ZKProva.

What happened:
[Brief, non-technical description of the incident]

Impact to your organization:
[Specific impact — e.g., "Verification requests may have
experienced elevated latency between HH:MM and HH:MM UTC"]

What we are doing:
[Actions taken to resolve and prevent recurrence]

Data impact:
[Confirm whether any member data was accessed or exposed.
If yes, describe scope and remediation steps.]

Current status: [Resolved / Ongoing with ETA]

Next steps:
[Any action required from the partner, or "No action required"]

If you have questions, please contact us at security@zkprova.com
(4-hour response SLA).

Sincerely,
[Communications Lead Name]
ZKProva Security Team

Template 3: Post-Mortem Report¶

Channel: Shared document (linked from Slack #incidents) When: Within 5 business days of incident resolution

POST-MORTEM: [Incident Title]
Date: [YYYY-MM-DD]
Severity: [P0/P1/P2]
Duration: [Start time — End time UTC]
Incident Commander: [Name]
Author: [Name]

TIMELINE (all times UTC)
HH:MM — [Event]
HH:MM — [Event]
...

ROOT CAUSE
[Technical description of what caused the incident]

IMPACT
- Duration: [X hours Y minutes]
- Users affected: [Count or description]
- Data exposure: [None / Description]
- Revenue impact: [None / Description]

DETECTION
- How was the incident detected? [Alert name / manual report]
- Time to detect: [Duration from incident start to first alert]

RESPONSE
- Time to acknowledge: [Duration]
- Time to contain: [Duration]
- Time to resolve: [Duration]

CONTRIBUTING FACTORS
1. [Factor]
2. [Factor]

CORRECTIVE ACTIONS
| Action | Owner | Due Date | Status |
|---|---|---|---|
| [Action item] | [Name] | [Date] | [Open/Done] |

LESSONS LEARNED
- What went well: [Description]
- What could be improved: [Description]

Tabletop Scenarios¶

These scenarios are exercised quarterly. Each exercise is documented with participants, decisions made, and improvements identified.

Scenario 1: Credential Database Breach¶

Setup: An attacker exploits a SQL injection vulnerability (bypassing Pydantic validation on a new, unvalidated endpoint) and exfiltrates the credentials table containing AES-256-GCM encrypted credential data and the members table containing email addresses and bcrypt-hashed passwords.

Discussion points:

Detection: Which alarm fires first? (Likely auth-failure-spike if attacker tests credentials, or WAF AWSManagedRulesSQLiRuleSet if SQLi pattern is caught.) What if the exfiltration is slow and stays under rate limits?
Severity: P0 — confirmed data breach with member PII (emails) exposed even though credential data is encrypted.
Containment: Block attacker IP via WAF. Rotate database credentials via Secrets Manager + kubectl rollout restart. Enable block_all_traffic WAF flag if active exploitation continues.
Eradication: Identify and patch the vulnerable endpoint. Deploy via emergency change process.
Recovery: Are encrypted credentials safe? AES-256-GCM is secure if the encryption key was not also exfiltrated. Verify encryption key isolation (stored in AWS Secrets Manager, not in DB). Force password reset for all members. Revoke and re-issue all active credentials.
Communication: Notify affected credit unions within 4 hours. Notify members whose emails were exposed. File regulatory notifications within 72 hours if GDPR/CCPA applies.
Post-mortem: Add SAST/DAST to CI pipeline (Gap #8). Review all endpoints for input validation coverage.

Scenario 2: Complete Service Outage¶

Setup: A misconfigured Helm release pushes a broken backend image to production. All 3 backend pods fail readiness probes and are taken out of service. The API returns 503 for all requests. Frontend loads but cannot communicate with the backend.

Discussion points:

Detection: zkprova-prod-high-5xx-rate alarm fires within 5 minutes (threshold: >5%). Sentry floods with connection errors. Kubernetes events show CrashLoopBackOff.
Severity: P0 — complete service outage.
Containment: Immediate Helm rollback: helm rollback zkprova <previous-revision>. PDB (minAvailable=2) should have prevented this — investigate why it didn't catch the bad rollout.
Recovery: Verify rollback success with kubectl rollout status. Run smoke tests (./scripts/smoke-test.sh). Check /health endpoint confirms DB connectivity. Monitor for 1 hour.
Root cause: Was the broken image caught by CI? If CI passed, why did production fail? (Environment-specific config, missing secret, resource limits?)
Prevention: Add post-deploy smoke tests to CI/CD pipeline (already implemented in Issue #101). Add canary deployment strategy. Review PDB configuration.

Scenario 3: ZKP Proof Forgery Attempt¶

Setup: A malicious lender submits a crafted verification request with a manipulated Groth16 proof that attempts to bypass the verification circuit. The proof claims a member has a credit score of 800 when the actual credential contains 620.

Discussion points:

Detection: The Groth16 verifier rejects the proof mathematically — this is a cryptographic guarantee, not an application-level check. Audit log records proof.verified with outcome=failure. If attempts are repeated, auth-failure-spike or rate limiting may trigger.
Severity: P2 initially (proof correctly rejected, no data exposure). Escalate to P1 if the pattern suggests a sophisticated attack targeting the circuit or proving system.
Investigation: Review the verification key and circuit. Is the attacker using a different circuit (trusted setup mismatch)? Are they attempting to exploit a known Groth16 vulnerability? Check if snarkjs version has known CVEs.
Containment: Rate-limit or block the lender's API key. Review all recent verifications from this lender for anomalies.
Escalation: If the proof passes verification with incorrect claims, this is P0 — the circuit or trusted setup is compromised. Immediately halt all verifications (block_all_traffic). Engage cryptography experts (Trail of Bits, preferred pentest firm).
Communication: Notify all credit unions that verification integrity may be compromised. Suspend credential acceptance until circuit audit is complete.

Post-Incident Review¶

Process¶

Schedule: Post-mortem meeting held within 3 business days of incident resolution
Participants: IC, Engineering Lead, Communications Lead, and any engineers involved in response
Format: Blameless — focus on systems and processes, not individuals
Output: Post-mortem document (Template 3 above) with corrective action items
Tracking: Action items created as GitHub Issues and tracked to completion
Review: Corrective actions reviewed in next quarterly security review

Metrics Tracked¶

Metric	Target	Measurement
Mean Time to Detect (MTTD)	<5 minutes for P0/P1	Time from incident start to first alert
Mean Time to Acknowledge (MTTA)	<15 minutes for P0	Time from alert to human acknowledgment
Mean Time to Contain (MTTC)	<1 hour for P0	Time from acknowledgment to containment
Mean Time to Resolve (MTTR)	<4 hours for P0	Time from detection to full resolution
Post-mortem completion rate	100% for P0/P1	Post-mortems completed within 5 business days

Document Control¶

Version	Date	Author	Description
1.0	2026-02-28	ZKProva Engineering	Initial incident response plan

This document satisfies SOC 2 Trust Service Criteria CC7.2 (Incident Management) and CC7.4 (Incident Communication). It is reviewed quarterly and updated after each P0/P1 incident post-mortem.