Skip to content

Production Launch Readiness

The operational gate that flips the platform from "F11 hardening complete" to "real clinic, real patients, real revenue." This is distinct from F11: F11 is the technical/regulatory feature set (features.md → F11); this document is the operational checklist that sequences the cutover and confirms every external dependency, every runbook, and every sign-off is in place.

This is not a feature spec. It's a gate document — a living checklist that fills in over the months between foundation closing and the first paying clinic going live.

The three gates, in order:

  1. Foundation gate (1E.3) — staging deployed, foundation acceptance test passes against AWS staging. No real patients. (foundation.md)
  2. Layer 2 / F-tier feature build-out — F1–F12 features built on top of the substrate. (features.md)
  3. Production launch readiness — this doc. The operational gate before real clinics see real production.

See decisions.md for the architectural decisions feeding this gate.


Acceptance criteria

The platform is ready for first paying clinic when all of the following are true:

  • Foundation 1E.3 closed (staging gate passed)
  • All F-tier features required at launch are merged, tested, deployed to staging, and validated end-to-end
  • F11 hardening complete (technical + regulatory features shipped, production environment provisioned and load-tested)
  • All pre-launch wiring tasks below complete
  • Romanian regulatory counsel has signed off on F11.0.5 findings
  • Legacy-data migration runbook executed in dry-run against a copy of the legacy database
  • Incident response on-call rotation exists and has been tested
  • Sign-off list below is fully checked

If any one is missing, the platform is not launching today.


Pre-cutover gates

Grouped by area. Each must be green before the cutover runbook executes.

Foundation + features

  • [ ] Foundation 1E.3 closed and locked (apps/docs/implementation-plan/foundation.md shows all checkboxes checked)
  • [ ] F1–F8 features that are part of launch scope are merged, tested, and deployed to staging
  • [ ] F9 telerehab is in launch scope OR explicitly deferred — confirm decision
  • [ ] F10 Telemetry service is deployed if F9 ships at launch (telerehab depends on telemetry; see telemetry/index.md)
  • [ ] F11 technical hardening complete: GDPR DSAR endpoint works end-to-end, prod KMS rotation tested, security scan passes, performance benchmarks recorded against staging-shape production environment

Infrastructure

  • [ ] Production AWS environment provisioned via Terraform (infra/envs/production) and idempotent
  • [ ] RDS Multi-AZ working; failover tested with a synthetic restart
  • [ ] pgbouncer Multi-AZ working; one task can drain without disruption
  • [ ] ALB + Cloudflare end-to-end (HTTPS, WAF active, custom-hostname provisioning verified)
  • [ ] All scheduled tasks (audit-partition-roll, usage-quota-reset, usage-summary-rollup, check-providers) have run successfully at least once in production
  • [ ] CloudWatch alarms configured per monitoring.md, routing to SNS → Slack
  • [ ] AWS Budgets alarms set at 50% / 80% / 100% of monthly budget

External services + sub-processors

  • [ ] Sentry — org created, projects per service (Core API + 3 Next.js apps + Telemetry if F10 launches), source-map upload step in CI working, release tracking by image SHA, alert routing to Slack tested
  • [ ] Cloudflare — Pro plan active, Cloudflare for SaaS configured, custom-hostnames API token in Secrets Manager, end-to-end test of clinic registering a custom domain
  • [ ] Bunny CDN — account active in EU region, DPA reviewed and signed, Bunny Stream library set up, admin upload workflow tested with reference exercise videos, signed playback URLs working from Patient Portal
  • [ ] Clerk — production-mode keys (not test keys), HIPAA-eligible plan if applicable, BAA signed, webhook endpoint working
  • [ ] Daily.co — production keys, HIPAA-eligible plan if F5 telerehab video is in scope, BAA signed
  • [ ] Anthropic — API keys provisioned for AI agent capabilities (foundation 1C.8); per-org budget controls active via 1C.7 metering; AI agent service shape decided and documented (foundation memory: AI agent runtime is still an open scoping question — close before agents go live)
  • [ ] AWS SES (infra) — production identity verified, DKIM + SPF + DMARC configured, sandbox exit confirmed, account-level suppression list active, SES configuration set with bounce + complaint event destinations pointing at an SNS topic
  • [ ] AWS SES (app-layer code, gap as of 2026-05-10) — the Core API has no bounce/complaint webhook handler today and no suppression table. Build before launch:
    • SNS topic restartix-prod-ses-feedback subscribed to SES bounce + complaint events; HTTPS subscription points at a new public Core API endpoint with SNS signature verification
    • Migration adding a notification_suppression table (recipient, reason: hard_bounce / complaint / manual, suppressed_at, source event ID) — RLS restricts reads to platform-admin (suppression is a platform concern, not per-org)
    • Webhook handler at POST /v1/internal/ses-feedback that verifies the SNS signature, parses the SES event payload, inserts into notification_suppression
    • EmailChannel precheck in internal/core/notify/: before dispatch, query suppression by recipient address; if present, mark the notification dead-lettered with dead_letter_reason='suppressed' rather than calling SES
    • migrations/core/000010_notifications.up.sql deferred-list line about "Bounce / complaint webhooks + suppression list automation" gets ticked here
  • [ ] All sub-processors disclosed in the platform DPA template (1B.10) and the Romanian-localized version
  • [ ] F11.0.5 Romanian compliance pass — counsel engaged, full findings documented, privacy notice template (1B.10) and DPA revised, ANSPDCP enforcement scan complete
  • [ ] Sub-processor list published to a public-facing page on the platform website per GDPR Art. 28 transparency
  • [ ] DPA template ready for clinic onboarding — countersigned versions stored per clinic
  • [ ] MDR Class I posture confirmed by regulatory counsel (or upgraded to Class IIa with appropriate process changes — see CLAUDE.md → Medical Device Readiness)
  • [ ] AWS BAA accepted via AWS Artifact (free; HIPAA-eligibility on the AWS account)
  • [ ] Legacy product DPA termination plan — when the legacy product shuts down, what's the data-handover and termination notice to its current users

Data migration

  • [ ] Legacy migration runbook in deployment.md → Runbook: launch-day legacy-data migration executed end-to-end as a dry run against a copy of the legacy database
  • [ ] Row-count and integrity validation queries documented and known to pass against the dry-run output
  • [ ] Legacy passwords (if migrating user accounts) — confirm Clerk's password import path or force-reset-on-first-login flow
  • [ ] Patient consent re-acquisition flow ready — legacy consents may not satisfy the new consent ledger schema; per-purpose re-consent on first login if needed
  • [ ] Rollback plan validated: PITR restore from "pre-launch" RDS snapshot works in <1h

Operational readiness

  • [ ] On-call rotation documented and committed to (PagerDuty / Slack alerts / phone tree)
  • [ ] Incident response playbook in monitoring.md → Incident Response Procedures reviewed by all on-call engineers
  • [ ] Synthetic incident drill — chaos test from monitoring.md executed in staging; alerts fired correctly; on-call responded within target time
  • [ ] Status page — public status page configured (statuspage.io / similar), automated by CloudWatch alarms or manual updates
  • [ ] Support escalation path — first-line clinic support routes to a human; engineering escalation path defined; severity levels with response-time targets
  • [ ] [email protected] mailbox provisioned — referenced in the break-glass email template (break_glass_opened.email.{en,ro}.tmpl). The template tells the recipient to contact support if anything looks wrong; without a working mailbox at that address, the email is misleading. Set up via Google Workspace / Microsoft 365 / Fastmail / similar with MX records pointing at the chosen provider. Same provisioning thread can land [email protected] (send-only, no inbox needed) and any other addresses the platform uses.
  • [ ] Cloudflare WAF coverage decision for clinic custom domains — by default, Cloudflare WAF rules on the restartix.pro zone do NOT extend to clinic custom hostnames (e.g. physio-bucharest.ro) via Cloudflare for SaaS. DDoS protection is always-on; L7 WAF is not. Three options at launch: (a) subscribe to Cloudflare's "WAF for SaaS" add-on so zone WAF rules extend to all custom hostnames, (b) reintroduce AWS WAF on the ALB to cover L7 attacks at the origin (reverses the "Cloudflare-only WAF" decision in decisions.md), or (c) accept the gap and rely on application-layer defenses. Decision needs to land before the first clinic with a custom domain handles real patient traffic.
  • [ ] Customer success runbook — what does first-clinic onboarding look like? Manual handholding for the first ~5 clinics, then standardized
  • [ ] Documentation portal for clinics — admin-facing how-to guides for setup-a-clinic, manage-staff, configure-billing, etc.

Sales + commercial readiness

  • [ ] First-clinic contract signed and includes the platform's standard MSA + DPA
  • [ ] Pricing locked for the launch tier (shared-mode default per tenant-isolation.md)
  • [ ] Billing flow tested end-to-end — clinic onboarded, subscription created, first invoice generated (manual via FGO at launch is fine; F12 engine ships later)
  • [ ] Patient consent flow at launch tested — patient signs up, accepts platform + clinic consents, onboards into the clinic, can withdraw consents granularly

Cutover runbook

Day-of sequence. Estimate: 2–6h depending on legacy-data migration size.

T-7 days

  • [ ] Communicate cutover schedule to first clinic (start time, expected duration, what they need to do)
  • [ ] Take a manual RDS snapshot of the (empty) production DB as a known-clean starting point
  • [ ] Confirm all sub-processor health (Cloudflare, Sentry, Bunny, Clerk, Daily.co, SES, Anthropic)
  • [ ] Final dry-run of the legacy migration against a copy

T-1 day

  • [ ] Final go/no-go review with everyone on the sign-off list below
  • [ ] Confirm on-call rotation knows it's "live tomorrow"
  • [ ] Pre-warm Cloudflare cache rules; verify edge cert + custom-hostname for the first clinic

T-0 (cutover day)

  1. Maintenance mode on legacy — return 503 with a branded maintenance page; communicate to legacy users
  2. Snapshot legacy database — full pg_dump from the legacy host, verified
  3. Pull legacy dump to a workstation with the operations IAM role
  4. Run data-transform pipeline (services/migration-tools/legacy-import per the runbook in deployment.md)
  5. Validate row counts against expected targets — block on any mismatch
  6. Spot-check a few real legacy users in the new system — sign in, verify profile, see expected appointments
  7. Take a "post-import" RDS snapshot as a known-good launch state
  8. Remove maintenance mode, route DNS to production
  9. Synthetic acceptance test — run a small canary script against production: sign in, list orgs, list specialists, create a test appointment, confirm it persists
  10. Notify first clinic — they can start onboarding their staff and patients
  11. Watch monitoring for 24h — on-call active, alarm channels open, dashboards visible

T+24h, T+1 week, T+1 month

  • [ ] T+24h: review error rate, p99 latency, no-failed-deploys, alarm-noise volume; if any of these is concerning, pause new clinic onboarding until resolved
  • [ ] T+1 week: post-launch retrospective with the team; what surprised us, what we'd do differently, what runbooks need updating
  • [ ] T+1 month: first paid invoice cycle complete (clinic charged successfully, AI cost roll-up correct, no dunning surprises)

Post-cutover monitoring (first 30 days)

What to watch, who watches, what triggers action.

SignalWhereThresholdAction
5xx error rateCloudWatch alarm restartix-production-core-api-5xx> 0.5% sustained 10mOn-call investigation
p99 latencyCloudWatch (Core API target group)> 2s sustained 10mOn-call investigation
RDS connection saturationCloudWatch> 80% of max_connectionsCapacity review
RDS replica lag (when read replicas exist)CloudWatch> 5s sustainedReplica health check
Clerk auth failuresClerk dashboard + SentrySpike vs. baselinePossible auth incident
Sentry new-error rateSentryNew issue class with high volumeTriage same-day
Bunny CDN delivery errorsBunny dashboard> 1% sustainedCDN health check
AWS spend trajectoryAWS Cost Explorer + Budgets> forecast for the monthCost review
Daily backup statusCloudWatchMissed backup or checksum mismatchCritical — investigate immediately
Audit-partition-roll cronCloudWatchFailureCritical — ensure next-month partitions exist

The full alarm catalogue lives in monitoring.md. This table is the post-launch focused subset.


Sign-off list

Before the cutover runbook executes, every signature here is required:

Sign-offOwnerConfirms
EngineeringTech leadF11 hardening complete; production environment validated; backup posture verified; incident response playbook reviewed
Regulatory / ComplianceRomanian counsel (F11.0.5)Privacy notice + DPA templates approved; MDR class confirmed; data residency confirmed; sub-processor list published
OperationsOn-call leadOn-call rotation in place; status page live; support escalation defined; chaos drill complete
Customer SuccessFirst-clinic onboarding leadFirst clinic ready; onboarding runbook tested; documentation portal usable
CommercialFounder / business ownerFirst-clinic contract signed; pricing locked; billing flow tested

What's deliberately not in this gate

To keep the gate honest about what blocks launch vs. what's nice-to-have:

  • F12 Billing engine — not a launch blocker. First clinic gets manually-cut FGO invoices until the engine ships (features.md → F12).
  • F13 Dedicated tenancy mode — deferred until first paying dedicated-mode contract.
  • Multi-region / data residency per-tenant — out of scope per CLAUDE.md.
  • Mobile apps — open decision in features.md; web-only is acceptable for launch.
  • Customer-managed KMS — Phase 1 uses AWS-managed; CMK migration triggers documented in aws-infrastructure.md → Customer-managed KMS migration path.
  • Datadog APM — CloudWatch + Sentry covers the launch; Datadog deferred until traffic + team scale justify it.
  • Cross-region S3 backup replication — Layer 3 of backup-disaster-recovery.md deferred to within first quarter post-launch.
  • Phase 2 read replicas — added when triggers in scaling-architecture.md → Lever 5 fire.