PayCraft Incident-Response Simulation Runbook
Phase 4 of paycraft-v2-production-readiness — quarterly tabletop exercise that proves the on-call (currently single-person, founder-led) can actually detect, mitigate, and post-mortem a real-world incident before customers demand it.
Cadence: Quarterly, first Thursday. Owner: Founder (no rotation today; promote to oncall.com rotation when ARR > $50K). Last exercised: (none — bootstrap scheduled at P4 sign-off)
Three rehearsable scenarios
Run one per quarter in rotation. Each scenario is timed; targets follow
the published SLA in docs/SLA_DASHBOARD.md.
Scenario 1 — Stripe webhook ingress goes 5xx
Trigger (simulated):
- In a staging environment, force the
stripe-webhookedge function tothrow new Error("simulated")on every event. - Send a Stripe-CLI replay of
payment_intent.succeededfromstripe trigger.
Expected detection (target ≤ 5 min):
- upptime probe to
/api/webhooks/stripe/__pingflips RED. - Sentry shows a spike of
webhook_retryevents taggedprovider=stripe. - Status page (manual update) goes to "Partial outage".
Expected mitigation (target ≤ 15 min):
- Identify offending code path (last deploy SHA in Vercel dashboard).
- Rollback via
/paycraft-deploy shipfrom the previous good SHA, OR hotfix the bug and redeploy. - Verify upptime probe returns GREEN.
- Verify Sentry retry rate drops.
Expected post-mortem (target ≤ 24 h):
- Post-mortem written in
docs/reports/postmortem-YYYYMMDD-webhook-5xx.md. - Linked from public status page.
Scenario 2 — Framework-supabase database is down (or wiped)
Trigger (simulated):
- Use a fresh local Supabase:
supabase stop && supabase start. - Point
dashboard/.env.localat the empty local instance. - Open
http://localhost:3000/dashboard— observe RLS-empty / 500s.
Expected detection (target ≤ 5 min):
/api/healthreturnsstatus: error(Postgresconnection refusedor empty schema).- upptime flips RED on the "Health endpoint" probe.
Expected mitigation (target ≤ 4 h per DR_RUNBOOK):
- Follow
docs/DR_RUNBOOK.mdStep 2-4. - Pull latest dump from R2.
- Restore into a fresh Supabase project.
- Repoint Vercel env to new project.
Expected post-mortem:
- Verify the daily backup that was restored from.
- Note actual RTO observed vs target (4 h).
- Append a drill row in
docs/DR_RUNBOOK.md#drill-log.
Scenario 3 — Suspected tenant data leak (cross-tenant access)
Trigger (simulated):
- In staging, deliberately add a row in
tenant_productswith the wrongtenant_idforeign key (bypassing the FK constraint via SQL editor). - Sign in as tenant A; navigate to
/products. - Observe whether RLS hides the row (it should —
tenant_products.selectpolicy ontenant_id = current_setting('app.tenant_id')).
Expected detection (target ≤ immediate):
- Tenant A's
/productspage renders correctly — does NOT include tenant B's products. dashboard/__tests__/api/rls-isolation.test.tsPASSES (CI gate).
Expected mitigation:
- If RLS did leak: file a SEV-1, freeze deploys, audit recent migrations
touching
tenant_productsRLS, revoke anyBYPASSRLSgrants. - Email all tenants within 72 h per GDPR / DPA Section 7.
- Rotate every tenant's API keys (forced via
rotate_api_keyfor all).
Expected post-mortem:
- Public post-mortem (no PII, just timeline + root cause + fix).
- Add a regression test to
__tests__/api/rls-isolation.test.ts.
Exercise checklist (per drill)
- Pick scenario from rotation
- Schedule 1-hour block on first Thursday of quarter
- Walk through Trigger / Detection / Mitigation / Post-Mortem above
- Time each phase against the target
- Identify ONE gap (broken alert, missing runbook entry, unclear ownership)
- Open a Linear ticket to close the gap before next quarter
- Append a row to the drill log below
Drill log
| Date | Operator | Scenario | Detect | Mitigate | Post-mortem | Gap identified |
|---|---|---|---|---|---|---|
| bootstrap | claude | (placeholder — first real drill scheduled at P4 sign-off) | TBD | TBD | TBD | none yet |
Related
docs/SLA_DASHBOARD.md— public SLA targetsdocs/DR_RUNBOOK.md— Scenario 2 mitigation proceduredashboard/__tests__/api/rls-isolation.test.ts— Scenario 3 CI gatedashboard/lib/sentry-events.ts— failure-mode capture surface- GOAL.md AC44-AC48 — Phase 4 acceptance criteria covering this runbook