Production Launch Readiness

The operational gate that flips the platform from "F11 hardening complete" to "real clinic, real patients, real revenue." This is distinct from F11: F11 is the technical/regulatory feature set (features.md → F11); this document is the operational checklist that sequences the cutover and confirms every external dependency, every runbook, and every sign-off is in place.

This is not a feature spec. It's a gate document — a living checklist that fills in over the months between foundation closing and the first paying clinic going live.

The three gates, in order:
Foundation gate (1E.3) — staging deployed, foundation acceptance test passes against AWS staging. No real patients. (foundation.md)
Layer 2 / F-tier feature build-out — F1–F12 features built on top of the substrate. (features.md)
Production launch readiness — this doc. The operational gate before real clinics see real production.
See decisions.md for the architectural decisions feeding this gate.

Acceptance criteria

The platform is ready for first paying clinic when all of the following are true:

Foundation 1E.3 closed (staging gate passed)
All F-tier features required at launch are merged, tested, deployed to staging, and validated end-to-end
F11 hardening complete (technical + regulatory features shipped, production environment provisioned and load-tested)
All pre-launch wiring tasks below complete
Romanian regulatory counsel has signed off on F11.0.5 findings
Legacy-data migration runbook executed in dry-run against a copy of the legacy database
Incident response on-call rotation exists and has been tested
Sign-off list below is fully checked

If any one is missing, the platform is not launching today.

Pre-cutover gates

Grouped by area. Each must be green before the cutover runbook executes.

Foundation + features

[ ] Foundation 1E.3 closed and locked (apps/docs/implementation-plan/foundation.md shows all checkboxes checked)
[ ] F1–F8 features that are part of launch scope are merged, tested, and deployed to staging
[ ] F9 telerehab is in launch scope OR explicitly deferred — confirm decision
[ ] F10 Telemetry service is deployed if F9 ships at launch (telerehab depends on telemetry; see telemetry/index.md)
[ ] F11 technical hardening complete: GDPR DSAR endpoint works end-to-end, prod KMS rotation tested, security scan passes, performance benchmarks recorded against staging-shape production environment

Infrastructure

[ ] Production AWS environment provisioned via Terraform (infra/envs/production) and idempotent
[ ] RDS Multi-AZ working; failover tested with a synthetic restart
[ ] pgbouncer Multi-AZ working; one task can drain without disruption
[ ] ALB + Cloudflare end-to-end (HTTPS, WAF active, custom-hostname provisioning verified)
[ ] All scheduled tasks (audit-partition-roll, usage-quota-reset, usage-summary-rollup, check-providers) have run successfully at least once in production
[ ] CloudWatch alarms configured per monitoring.md, routing to SNS → Slack
[ ] AWS Budgets alarms set at 50% / 80% / 100% of monthly budget

External services + sub-processors

[ ] Sentry — org created, projects per service (Core API + 3 Next.js apps + Telemetry if F10 launches), source-map upload step in CI working, release tracking by image SHA, alert routing to Slack tested
[ ] Cloudflare — Pro plan active, Cloudflare for SaaS configured, custom-hostnames API token in Secrets Manager, end-to-end test of clinic registering a custom domain
[ ] Bunny CDN — account active in EU region, DPA reviewed and signed, Bunny Stream library set up, admin upload workflow tested with reference exercise videos, signed playback URLs working from Patient Portal
[ ] Clerk — production-mode keys (not test keys), HIPAA-eligible plan if applicable, BAA signed, webhook endpoint working
[ ] Daily.co — production keys, HIPAA-eligible plan if F5 telerehab video is in scope, BAA signed
[ ] Anthropic — API keys provisioned for AI agent capabilities (foundation 1C.8); per-org budget controls active via 1C.7 metering; AI agent service shape decided and documented (foundation memory: AI agent runtime is still an open scoping question — close before agents go live)
[ ] AWS SES (infra) — production identity verified, DKIM + SPF + DMARC configured, sandbox exit confirmed, account-level suppression list active, SES configuration set with bounce + complaint event destinations pointing at an SNS topic
[ ] AWS SES (app-layer code, gap as of 2026-05-10) — the Core API has no bounce/complaint webhook handler today and no suppression table. Build before launch:
- SNS topic restartix-prod-ses-feedback subscribed to SES bounce + complaint events; HTTPS subscription points at a new public Core API endpoint with SNS signature verification
- Migration adding a notification_suppression table (recipient, reason: hard_bounce / complaint / manual, suppressed_at, source event ID) — RLS restricts reads to platform-admin (suppression is a platform concern, not per-org)
- Webhook handler at POST /v1/internal/ses-feedback that verifies the SNS signature, parses the SES event payload, inserts into notification_suppression
- EmailChannel precheck in internal/core/notify/: before dispatch, query suppression by recipient address; if present, mark the notification dead-lettered with dead_letter_reason='suppressed' rather than calling SES
- migrations/core/000010_notifications.up.sql deferred-list line about "Bounce / complaint webhooks + suppression list automation" gets ticked here
[ ] All sub-processors disclosed in the platform DPA template (1B.10) and the Romanian-localized version

Compliance + legal

[ ] F11.0.5 Romanian compliance pass — counsel engaged, full findings documented, privacy notice template (1B.10) and DPA revised, ANSPDCP enforcement scan complete
[ ] Sub-processor list published to a public-facing page on the platform website per GDPR Art. 28 transparency
[ ] DPA template ready for clinic onboarding — countersigned versions stored per clinic
[ ] MDR Class I posture confirmed by regulatory counsel (or upgraded to Class IIa with appropriate process changes — see CLAUDE.md → Medical Device Readiness)
[ ] AWS BAA accepted via AWS Artifact (free; HIPAA-eligibility on the AWS account)
[ ] Legacy product DPA termination plan — when the legacy product shuts down, what's the data-handover and termination notice to its current users

Data migration

[ ] Legacy migration runbook in deployment.md → Runbook: launch-day legacy-data migration executed end-to-end as a dry run against a copy of the legacy database
[ ] Row-count and integrity validation queries documented and known to pass against the dry-run output
[ ] Legacy passwords (if migrating user accounts) — confirm Clerk's password import path or force-reset-on-first-login flow
[ ] Patient consent re-acquisition flow ready — legacy consents may not satisfy the new consent ledger schema; per-purpose re-consent on first login if needed
[ ] Rollback plan validated: PITR restore from "pre-launch" RDS snapshot works in <1h

Operational readiness

[ ] On-call rotation documented and committed to (PagerDuty / Slack alerts / phone tree)
[ ] Incident response playbook in monitoring.md → Incident Response Procedures reviewed by all on-call engineers
[ ] Synthetic incident drill — chaos test from monitoring.md executed in staging; alerts fired correctly; on-call responded within target time
[ ] Status page — public status page configured (statuspage.io / similar), automated by CloudWatch alarms or manual updates
[ ] Support escalation path — first-line clinic support routes to a human; engineering escalation path defined; severity levels with response-time targets
[ ] [email protected] mailbox provisioned — referenced in the break-glass email template (break_glass_opened.email.{en,ro}.tmpl). The template tells the recipient to contact support if anything looks wrong; without a working mailbox at that address, the email is misleading. Set up via Google Workspace / Microsoft 365 / Fastmail / similar with MX records pointing at the chosen provider. Same provisioning thread can land [email protected] (send-only, no inbox needed) and any other addresses the platform uses.
[ ] Cloudflare WAF coverage decision for clinic custom domains — by default, Cloudflare WAF rules on the restartix.pro zone do NOT extend to clinic custom hostnames (e.g. physio-bucharest.ro) via Cloudflare for SaaS. DDoS protection is always-on; L7 WAF is not. Three options at launch: (a) subscribe to Cloudflare's "WAF for SaaS" add-on so zone WAF rules extend to all custom hostnames, (b) reintroduce AWS WAF on the ALB to cover L7 attacks at the origin (reverses the "Cloudflare-only WAF" decision in decisions.md), or (c) accept the gap and rely on application-layer defenses. Decision needs to land before the first clinic with a custom domain handles real patient traffic.
[ ] Customer success runbook — what does first-clinic onboarding look like? Manual handholding for the first ~5 clinics, then standardized
[ ] Documentation portal for clinics — admin-facing how-to guides for setup-a-clinic, manage-staff, configure-billing, etc.

Sales + commercial readiness

[ ] First-clinic contract signed and includes the platform's standard MSA + DPA
[ ] Pricing locked for the launch tier (shared-mode default per tenant-isolation.md)
[ ] Billing flow tested end-to-end — clinic onboarded, subscription created, first invoice generated (manual via FGO at launch is fine; F12 engine ships later)
[ ] Patient consent flow at launch tested — patient signs up, accepts platform + clinic consents, onboards into the clinic, can withdraw consents granularly

Cutover runbook

Day-of sequence. Estimate: 2–6h depending on legacy-data migration size.

T-7 days

[ ] Communicate cutover schedule to first clinic (start time, expected duration, what they need to do)
[ ] Take a manual RDS snapshot of the (empty) production DB as a known-clean starting point
[ ] Confirm all sub-processor health (Cloudflare, Sentry, Bunny, Clerk, Daily.co, SES, Anthropic)
[ ] Final dry-run of the legacy migration against a copy

T-1 day

[ ] Final go/no-go review with everyone on the sign-off list below
[ ] Confirm on-call rotation knows it's "live tomorrow"
[ ] Pre-warm Cloudflare cache rules; verify edge cert + custom-hostname for the first clinic

T-0 (cutover day)

Maintenance mode on legacy — return 503 with a branded maintenance page; communicate to legacy users
Snapshot legacy database — full pg_dump from the legacy host, verified
Pull legacy dump to a workstation with the operations IAM role
Run data-transform pipeline (services/migration-tools/legacy-import per the runbook in deployment.md)
Validate row counts against expected targets — block on any mismatch
Spot-check a few real legacy users in the new system — sign in, verify profile, see expected appointments
Take a "post-import" RDS snapshot as a known-good launch state
Remove maintenance mode, route DNS to production
Synthetic acceptance test — run a small canary script against production: sign in, list orgs, list specialists, create a test appointment, confirm it persists
Notify first clinic — they can start onboarding their staff and patients
Watch monitoring for 24h — on-call active, alarm channels open, dashboards visible

T+24h, T+1 week, T+1 month

[ ] T+24h: review error rate, p99 latency, no-failed-deploys, alarm-noise volume; if any of these is concerning, pause new clinic onboarding until resolved
[ ] T+1 week: post-launch retrospective with the team; what surprised us, what we'd do differently, what runbooks need updating
[ ] T+1 month: first paid invoice cycle complete (clinic charged successfully, AI cost roll-up correct, no dunning surprises)

Post-cutover monitoring (first 30 days)

What to watch, who watches, what triggers action.

Signal	Where	Threshold	Action
5xx error rate	CloudWatch alarm `restartix-production-core-api-5xx`	> 0.5% sustained 10m	On-call investigation
p99 latency	CloudWatch (Core API target group)	> 2s sustained 10m	On-call investigation
RDS connection saturation	CloudWatch	> 80% of `max_connections`	Capacity review
RDS replica lag (when read replicas exist)	CloudWatch	> 5s sustained	Replica health check
Clerk auth failures	Clerk dashboard + Sentry	Spike vs. baseline	Possible auth incident
Sentry new-error rate	Sentry	New issue class with high volume	Triage same-day
Bunny CDN delivery errors	Bunny dashboard	> 1% sustained	CDN health check
AWS spend trajectory	AWS Cost Explorer + Budgets	> forecast for the month	Cost review
Daily backup status	CloudWatch	Missed backup or checksum mismatch	Critical — investigate immediately
Audit-partition-roll cron	CloudWatch	Failure	Critical — ensure next-month partitions exist

The full alarm catalogue lives in monitoring.md. This table is the post-launch focused subset.

Sign-off list

Before the cutover runbook executes, every signature here is required:

Sign-off	Owner	Confirms
Engineering	Tech lead	F11 hardening complete; production environment validated; backup posture verified; incident response playbook reviewed
Regulatory / Compliance	Romanian counsel (F11.0.5)	Privacy notice + DPA templates approved; MDR class confirmed; data residency confirmed; sub-processor list published
Operations	On-call lead	On-call rotation in place; status page live; support escalation defined; chaos drill complete
Customer Success	First-clinic onboarding lead	First clinic ready; onboarding runbook tested; documentation portal usable
Commercial	Founder / business owner	First-clinic contract signed; pricing locked; billing flow tested

What's deliberately not in this gate

To keep the gate honest about what blocks launch vs. what's nice-to-have:

F12 Billing engine — not a launch blocker. First clinic gets manually-cut FGO invoices until the engine ships (features.md → F12).
F13 Dedicated tenancy mode — deferred until first paying dedicated-mode contract.
Multi-region / data residency per-tenant — out of scope per CLAUDE.md.
Mobile apps — open decision in features.md; web-only is acceptable for launch.
Customer-managed KMS — Phase 1 uses AWS-managed; CMK migration triggers documented in aws-infrastructure.md → Customer-managed KMS migration path.
Datadog APM — CloudWatch + Sentry covers the launch; Datadog deferred until traffic + team scale justify it.
Cross-region S3 backup replication — Layer 3 of backup-disaster-recovery.md deferred to within first quarter post-launch.
Phase 2 read replicas — added when triggers in scaling-architecture.md → Lever 5 fire.

implementation-plan.md — master plan
foundation.md — foundation 1A–1E
features.md — F1–F12 + F13
aws-infrastructure.md — full topology and cost
iac-layout.md — Terraform module structure
deployment.md — CI/CD pipeline + runbooks (including legacy-data migration)
scaling-architecture.md — connection math + scaling levers
monitoring.md — alarms + incident response
backup-disaster-recovery.md — RPO/RTO + DR drills
decisions.md — architectural rationale
external-providers.md — sub-processor list

Production Launch Readiness ​

Acceptance criteria ​

Pre-cutover gates ​

Foundation + features ​

Infrastructure ​

External services + sub-processors ​

Compliance + legal ​

Data migration ​

Operational readiness ​

Sales + commercial readiness ​

Cutover runbook ​

T-7 days ​

T-1 day ​

T-0 (cutover day) ​

T+24h, T+1 week, T+1 month ​

Post-cutover monitoring (first 30 days) ​

Sign-off list ​

What's deliberately not in this gate ​

Related documentation ​