PayCraft Incident-Response Simulation Runbook

Phase 4 of paycraft-v2-production-readiness — quarterly tabletop exercise that proves the on-call (currently single-person, founder-led) can actually detect, mitigate, and post-mortem a real-world incident before customers demand it.

Cadence: Quarterly, first Thursday. Owner: Founder (no rotation today; promote to oncall.com rotation when ARR > $50K). Last exercised: (none — bootstrap scheduled at P4 sign-off)

Three rehearsable scenarios

Run one per quarter in rotation. Each scenario is timed; targets follow the published SLA in docs/SLA_DASHBOARD.md.

Scenario 1 — Stripe webhook ingress goes 5xx

Trigger (simulated):

In a staging environment, force the stripe-webhook edge function to throw new Error("simulated") on every event.
Send a Stripe-CLI replay of payment_intent.succeeded from stripe trigger.

Expected detection (target ≤ 5 min):

upptime probe to /api/webhooks/stripe/__ping flips RED.
Sentry shows a spike of webhook_retry events tagged provider=stripe.
Status page (manual update) goes to "Partial outage".

Expected mitigation (target ≤ 15 min):

Identify offending code path (last deploy SHA in Vercel dashboard).
Rollback via /paycraft-deploy ship from the previous good SHA, OR hotfix the bug and redeploy.
Verify upptime probe returns GREEN.
Verify Sentry retry rate drops.

Expected post-mortem (target ≤ 24 h):

Post-mortem written in docs/reports/postmortem-YYYYMMDD-webhook-5xx.md.
Linked from public status page.

Scenario 2 — Framework-supabase database is down (or wiped)

Trigger (simulated):

Use a fresh local Supabase: supabase stop && supabase start.
Point dashboard/.env.local at the empty local instance.
Open http://localhost:3000/dashboard — observe RLS-empty / 500s.

Expected detection (target ≤ 5 min):

/api/health returns status: error (Postgres connection refused or empty schema).
upptime flips RED on the "Health endpoint" probe.

Expected mitigation (target ≤ 4 h per DR_RUNBOOK):

Follow docs/DR_RUNBOOK.md Step 2-4.
Pull latest dump from R2.
Restore into a fresh Supabase project.
Repoint Vercel env to new project.

Expected post-mortem:

Verify the daily backup that was restored from.
Note actual RTO observed vs target (4 h).
Append a drill row in docs/DR_RUNBOOK.md#drill-log.

Scenario 3 — Suspected tenant data leak (cross-tenant access)

Trigger (simulated):

In staging, deliberately add a row in tenant_products with the wrong tenant_id foreign key (bypassing the FK constraint via SQL editor).
Sign in as tenant A; navigate to /products.
Observe whether RLS hides the row (it should — tenant_products.select policy on tenant_id = current_setting('app.tenant_id')).

Expected detection (target ≤ immediate):

Tenant A's /products page renders correctly — does NOT include tenant B's products.
dashboard/__tests__/api/rls-isolation.test.ts PASSES (CI gate).

Expected mitigation:

If RLS did leak: file a SEV-1, freeze deploys, audit recent migrations touching tenant_products RLS, revoke any BYPASSRLS grants.
Email all tenants within 72 h per GDPR / DPA Section 7.
Rotate every tenant's API keys (forced via rotate_api_key for all).

Expected post-mortem:

Public post-mortem (no PII, just timeline + root cause + fix).
Add a regression test to __tests__/api/rls-isolation.test.ts.

Exercise checklist (per drill)

Pick scenario from rotation
Schedule 1-hour block on first Thursday of quarter
Walk through Trigger / Detection / Mitigation / Post-Mortem above
Time each phase against the target
Identify ONE gap (broken alert, missing runbook entry, unclear ownership)
Open a Linear ticket to close the gap before next quarter
Append a row to the drill log below

Drill log

Date	Operator	Scenario	Detect	Mitigate	Post-mortem	Gap identified
bootstrap	claude	(placeholder — first real drill scheduled at P4 sign-off)	TBD	TBD	TBD	none yet

docs/SLA_DASHBOARD.md — public SLA targets
docs/DR_RUNBOOK.md — Scenario 2 mitigation procedure
dashboard/__tests__/api/rls-isolation.test.ts — Scenario 3 CI gate
dashboard/lib/sentry-events.ts — failure-mode capture surface
GOAL.md AC44-AC48 — Phase 4 acceptance criteria covering this runbook

Three rehearsable scenarios​

Scenario 1 — Stripe webhook ingress goes 5xx​

Scenario 2 — Framework-supabase database is down (or wiped)​

Scenario 3 — Suspected tenant data leak (cross-tenant access)​

Exercise checklist (per drill)​

Drill log​

Related​

Three rehearsable scenarios

Scenario 1 — Stripe webhook ingress goes 5xx

Scenario 2 — Framework-supabase database is down (or wiped)

Scenario 3 — Suspected tenant data leak (cross-tenant access)

Exercise checklist (per drill)

Drill log

Related