Production Launch Readiness
The operational gate that flips the platform from "F11 hardening complete" to "real clinic, real patients, real revenue." This is distinct from F11: F11 is the technical/regulatory feature set (features.md → F11); this document is the operational checklist that sequences the cutover and confirms every external dependency, every runbook, and every sign-off is in place.
This is not a feature spec. It's a gate document — a living checklist that fills in over the months between foundation closing and the first paying clinic going live.
The three gates, in order:
- Foundation gate (1E.3) — staging deployed, foundation acceptance test passes against AWS staging. No real patients. (foundation.md)
- Layer 2 / F-tier feature build-out — F1–F12 features built on top of the substrate. (features.md)
- Production launch readiness — this doc. The operational gate before real clinics see real production.
See decisions.md for the architectural decisions feeding this gate.
Acceptance criteria
The platform is ready for first paying clinic when all of the following are true:
- Foundation 1E.3 closed (staging gate passed)
- All F-tier features required at launch are merged, tested, deployed to staging, and validated end-to-end
- F11 hardening complete (technical + regulatory features shipped, production environment provisioned and load-tested)
- All pre-launch wiring tasks below complete
- Romanian regulatory counsel has signed off on F11.0.5 findings
- Legacy-data migration runbook executed in dry-run against a copy of the legacy database
- Incident response on-call rotation exists and has been tested
- Sign-off list below is fully checked
If any one is missing, the platform is not launching today.
Pre-cutover gates
Grouped by area. Each must be green before the cutover runbook executes.
Foundation + features
- [ ] Foundation 1E.3 closed and locked (
apps/docs/implementation-plan/foundation.mdshows all checkboxes checked) - [ ] F1–F8 features that are part of launch scope are merged, tested, and deployed to staging
- [ ] F9 telerehab is in launch scope OR explicitly deferred — confirm decision
- [ ] F10 Telemetry service is deployed if F9 ships at launch (telerehab depends on telemetry; see telemetry/index.md)
- [ ] F11 technical hardening complete: GDPR DSAR endpoint works end-to-end, prod KMS rotation tested, security scan passes, performance benchmarks recorded against staging-shape production environment
Infrastructure
- [ ] Production AWS environment provisioned via Terraform (
infra/envs/production) and idempotent - [ ] RDS Multi-AZ working; failover tested with a synthetic restart
- [ ] pgbouncer Multi-AZ working; one task can drain without disruption
- [ ] ALB + Cloudflare end-to-end (HTTPS, WAF active, custom-hostname provisioning verified)
- [ ] All scheduled tasks (
audit-partition-roll,usage-quota-reset,usage-summary-rollup,check-providers) have run successfully at least once in production - [ ] CloudWatch alarms configured per monitoring.md, routing to SNS → Slack
- [ ] AWS Budgets alarms set at 50% / 80% / 100% of monthly budget
External services + sub-processors
- [ ] Sentry — org created, projects per service (Core API + 3 Next.js apps + Telemetry if F10 launches), source-map upload step in CI working, release tracking by image SHA, alert routing to Slack tested
- [ ] Cloudflare — Pro plan active, Cloudflare for SaaS configured, custom-hostnames API token in Secrets Manager, end-to-end test of clinic registering a custom domain
- [ ] Bunny CDN — account active in EU region, DPA reviewed and signed, Bunny Stream library set up, admin upload workflow tested with reference exercise videos, signed playback URLs working from Patient Portal
- [ ] Clerk — production-mode keys (not test keys), HIPAA-eligible plan if applicable, BAA signed, webhook endpoint working
- [ ] Daily.co — production keys, HIPAA-eligible plan if F5 telerehab video is in scope, BAA signed
- [ ] Anthropic — API keys provisioned for AI agent capabilities (foundation 1C.8); per-org budget controls active via 1C.7 metering; AI agent service shape decided and documented (foundation memory: AI agent runtime is still an open scoping question — close before agents go live)
- [ ] AWS SES (infra) — production identity verified, DKIM + SPF + DMARC configured, sandbox exit confirmed, account-level suppression list active, SES configuration set with bounce + complaint event destinations pointing at an SNS topic
- [ ] AWS SES (app-layer code, gap as of 2026-05-10) — the Core API has no bounce/complaint webhook handler today and no suppression table. Build before launch:
- SNS topic
restartix-prod-ses-feedbacksubscribed to SES bounce + complaint events; HTTPS subscription points at a new public Core API endpoint with SNS signature verification - Migration adding a
notification_suppressiontable (recipient, reason:hard_bounce/complaint/manual,suppressed_at, source event ID) — RLS restricts reads to platform-admin (suppression is a platform concern, not per-org) - Webhook handler at
POST /v1/internal/ses-feedbackthat verifies the SNS signature, parses the SES event payload, inserts intonotification_suppression EmailChannelprecheck in internal/core/notify/: before dispatch, query suppression by recipient address; if present, mark the notification dead-lettered withdead_letter_reason='suppressed'rather than calling SESmigrations/core/000010_notifications.up.sqldeferred-list line about "Bounce / complaint webhooks + suppression list automation" gets ticked here
- SNS topic
- [ ] All sub-processors disclosed in the platform DPA template (1B.10) and the Romanian-localized version
Compliance + legal
- [ ] F11.0.5 Romanian compliance pass — counsel engaged, full findings documented, privacy notice template (1B.10) and DPA revised, ANSPDCP enforcement scan complete
- [ ] Sub-processor list published to a public-facing page on the platform website per GDPR Art. 28 transparency
- [ ] DPA template ready for clinic onboarding — countersigned versions stored per clinic
- [ ] MDR Class I posture confirmed by regulatory counsel (or upgraded to Class IIa with appropriate process changes — see CLAUDE.md → Medical Device Readiness)
- [ ] AWS BAA accepted via AWS Artifact (free; HIPAA-eligibility on the AWS account)
- [ ] Legacy product DPA termination plan — when the legacy product shuts down, what's the data-handover and termination notice to its current users
Data migration
- [ ] Legacy migration runbook in deployment.md → Runbook: launch-day legacy-data migration executed end-to-end as a dry run against a copy of the legacy database
- [ ] Row-count and integrity validation queries documented and known to pass against the dry-run output
- [ ] Legacy passwords (if migrating user accounts) — confirm Clerk's password import path or force-reset-on-first-login flow
- [ ] Patient consent re-acquisition flow ready — legacy consents may not satisfy the new consent ledger schema; per-purpose re-consent on first login if needed
- [ ] Rollback plan validated: PITR restore from "pre-launch" RDS snapshot works in <1h
Operational readiness
- [ ] On-call rotation documented and committed to (PagerDuty / Slack alerts / phone tree)
- [ ] Incident response playbook in monitoring.md → Incident Response Procedures reviewed by all on-call engineers
- [ ] Synthetic incident drill — chaos test from monitoring.md executed in staging; alerts fired correctly; on-call responded within target time
- [ ] Status page — public status page configured (statuspage.io / similar), automated by CloudWatch alarms or manual updates
- [ ] Support escalation path — first-line clinic support routes to a human; engineering escalation path defined; severity levels with response-time targets
- [ ]
[email protected]mailbox provisioned — referenced in the break-glass email template (break_glass_opened.email.{en,ro}.tmpl). The template tells the recipient to contact support if anything looks wrong; without a working mailbox at that address, the email is misleading. Set up via Google Workspace / Microsoft 365 / Fastmail / similar with MX records pointing at the chosen provider. Same provisioning thread can land[email protected](send-only, no inbox needed) and any other addresses the platform uses. - [ ] Cloudflare WAF coverage decision for clinic custom domains — by default, Cloudflare WAF rules on the
restartix.prozone do NOT extend to clinic custom hostnames (e.g.physio-bucharest.ro) via Cloudflare for SaaS. DDoS protection is always-on; L7 WAF is not. Three options at launch: (a) subscribe to Cloudflare's "WAF for SaaS" add-on so zone WAF rules extend to all custom hostnames, (b) reintroduce AWS WAF on the ALB to cover L7 attacks at the origin (reverses the "Cloudflare-only WAF" decision indecisions.md), or (c) accept the gap and rely on application-layer defenses. Decision needs to land before the first clinic with a custom domain handles real patient traffic. - [ ] Customer success runbook — what does first-clinic onboarding look like? Manual handholding for the first ~5 clinics, then standardized
- [ ] Documentation portal for clinics — admin-facing how-to guides for setup-a-clinic, manage-staff, configure-billing, etc.
Sales + commercial readiness
- [ ] First-clinic contract signed and includes the platform's standard MSA + DPA
- [ ] Pricing locked for the launch tier (shared-mode default per tenant-isolation.md)
- [ ] Billing flow tested end-to-end — clinic onboarded, subscription created, first invoice generated (manual via FGO at launch is fine; F12 engine ships later)
- [ ] Patient consent flow at launch tested — patient signs up, accepts platform + clinic consents, onboards into the clinic, can withdraw consents granularly
Cutover runbook
Day-of sequence. Estimate: 2–6h depending on legacy-data migration size.
T-7 days
- [ ] Communicate cutover schedule to first clinic (start time, expected duration, what they need to do)
- [ ] Take a manual RDS snapshot of the (empty) production DB as a known-clean starting point
- [ ] Confirm all sub-processor health (Cloudflare, Sentry, Bunny, Clerk, Daily.co, SES, Anthropic)
- [ ] Final dry-run of the legacy migration against a copy
T-1 day
- [ ] Final go/no-go review with everyone on the sign-off list below
- [ ] Confirm on-call rotation knows it's "live tomorrow"
- [ ] Pre-warm Cloudflare cache rules; verify edge cert + custom-hostname for the first clinic
T-0 (cutover day)
- Maintenance mode on legacy — return 503 with a branded maintenance page; communicate to legacy users
- Snapshot legacy database — full pg_dump from the legacy host, verified
- Pull legacy dump to a workstation with the operations IAM role
- Run data-transform pipeline (
services/migration-tools/legacy-importper the runbook in deployment.md) - Validate row counts against expected targets — block on any mismatch
- Spot-check a few real legacy users in the new system — sign in, verify profile, see expected appointments
- Take a "post-import" RDS snapshot as a known-good launch state
- Remove maintenance mode, route DNS to production
- Synthetic acceptance test — run a small canary script against production: sign in, list orgs, list specialists, create a test appointment, confirm it persists
- Notify first clinic — they can start onboarding their staff and patients
- Watch monitoring for 24h — on-call active, alarm channels open, dashboards visible
T+24h, T+1 week, T+1 month
- [ ] T+24h: review error rate, p99 latency, no-failed-deploys, alarm-noise volume; if any of these is concerning, pause new clinic onboarding until resolved
- [ ] T+1 week: post-launch retrospective with the team; what surprised us, what we'd do differently, what runbooks need updating
- [ ] T+1 month: first paid invoice cycle complete (clinic charged successfully, AI cost roll-up correct, no dunning surprises)
Post-cutover monitoring (first 30 days)
What to watch, who watches, what triggers action.
| Signal | Where | Threshold | Action |
|---|---|---|---|
| 5xx error rate | CloudWatch alarm restartix-production-core-api-5xx | > 0.5% sustained 10m | On-call investigation |
| p99 latency | CloudWatch (Core API target group) | > 2s sustained 10m | On-call investigation |
| RDS connection saturation | CloudWatch | > 80% of max_connections | Capacity review |
| RDS replica lag (when read replicas exist) | CloudWatch | > 5s sustained | Replica health check |
| Clerk auth failures | Clerk dashboard + Sentry | Spike vs. baseline | Possible auth incident |
| Sentry new-error rate | Sentry | New issue class with high volume | Triage same-day |
| Bunny CDN delivery errors | Bunny dashboard | > 1% sustained | CDN health check |
| AWS spend trajectory | AWS Cost Explorer + Budgets | > forecast for the month | Cost review |
| Daily backup status | CloudWatch | Missed backup or checksum mismatch | Critical — investigate immediately |
| Audit-partition-roll cron | CloudWatch | Failure | Critical — ensure next-month partitions exist |
The full alarm catalogue lives in monitoring.md. This table is the post-launch focused subset.
Sign-off list
Before the cutover runbook executes, every signature here is required:
| Sign-off | Owner | Confirms |
|---|---|---|
| Engineering | Tech lead | F11 hardening complete; production environment validated; backup posture verified; incident response playbook reviewed |
| Regulatory / Compliance | Romanian counsel (F11.0.5) | Privacy notice + DPA templates approved; MDR class confirmed; data residency confirmed; sub-processor list published |
| Operations | On-call lead | On-call rotation in place; status page live; support escalation defined; chaos drill complete |
| Customer Success | First-clinic onboarding lead | First clinic ready; onboarding runbook tested; documentation portal usable |
| Commercial | Founder / business owner | First-clinic contract signed; pricing locked; billing flow tested |
What's deliberately not in this gate
To keep the gate honest about what blocks launch vs. what's nice-to-have:
- F12 Billing engine — not a launch blocker. First clinic gets manually-cut FGO invoices until the engine ships (features.md → F12).
- F13 Dedicated tenancy mode — deferred until first paying dedicated-mode contract.
- Multi-region / data residency per-tenant — out of scope per CLAUDE.md.
- Mobile apps — open decision in features.md; web-only is acceptable for launch.
- Customer-managed KMS — Phase 1 uses AWS-managed; CMK migration triggers documented in aws-infrastructure.md → Customer-managed KMS migration path.
- Datadog APM — CloudWatch + Sentry covers the launch; Datadog deferred until traffic + team scale justify it.
- Cross-region S3 backup replication — Layer 3 of backup-disaster-recovery.md deferred to within first quarter post-launch.
- Phase 2 read replicas — added when triggers in scaling-architecture.md → Lever 5 fire.
Related documentation
- implementation-plan.md — master plan
- foundation.md — foundation 1A–1E
- features.md — F1–F12 + F13
- aws-infrastructure.md — full topology and cost
- iac-layout.md — Terraform module structure
- deployment.md — CI/CD pipeline + runbooks (including legacy-data migration)
- scaling-architecture.md — connection math + scaling levers
- monitoring.md — alarms + incident response
- backup-disaster-recovery.md — RPO/RTO + DR drills
- decisions.md — architectural rationale
- external-providers.md — sub-processor list