AWS Infrastructure
The RestartiX platform runs on AWS in eu-central-1 (Frankfurt) with Cloudflare at the edge. Compute is ECS Fargate for every long-running process. Data lives in RDS Postgres (production) or Aurora Serverless v2 (staging) plus ElastiCache Redis and S3. Cloudflare handles DNS, CDN, WAF, and per-tenant custom-domain TLS via Cloudflare for SaaS. Infrastructure is managed entirely as code with Terraform.
This document describes the steady-state architecture — services, sizing, networking, costs. Operational concerns live in linked docs:
- Deployment & CI/CD — how code reaches production
- IaC layout — Terraform module structure
- Scaling plan — growth phases and triggers
- Backup & DR — RPO/RTO, lifecycle
- Monitoring — alarms and dashboards
- Decisions — why this shape and not another
Provider stack at a glance
┌────────────────────────────────────────────────────┐
Patients ──────► │ Cloudflare (edge) │
Specialists │ • DNS for restartix.pro │
Admins │ • CDN for /_next/static/* │
│ • WAF + DDoS + bot protection │
│ • Cloudflare for SaaS (per-clinic custom domains) │
└────────────────┬───────────────────────────────────┘
│ HTTPS
▼
┌────────────────────────────────────────────────────┐
│ AWS eu-central-1 (Frankfurt) │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Application Load Balancer │ │
│ │ (TLS via ACM, host-based routing) │ │
│ └────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────▼────────────────────────────┐ │
│ │ ECS Fargate cluster │ │
│ │ ┌────────────┐ ┌────────────┐ ┌──────────┐ │ │
│ │ │ Core API │ │ Telemetry │ │ pgbouncer│ │ │
│ │ │ (Go) │ │ API (Go) │ │ │ │ │
│ │ └────────────┘ └────────────┘ └──────────┘ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌──────────┐ │ │
│ │ │ Clinic │ │ Portal │ │ Console │ │ │
│ │ │ (Next.js) │ │ (Next.js) │ │ (Next.js)│ │ │
│ │ └────────────┘ └────────────┘ └──────────┘ │ │
│ └──────────────┬─────────────┬────────────────┘ │
│ │ │ │
│ ┌──────────────▼──┐ ┌──────▼───────────┐ │
│ │ RDS Postgres 17 │ │ ElastiCache │ │
│ │ (or Aurora SLv2) │ │ Redis │ │
│ └─────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ S3 │ │ KMS │ │ Secrets Manager │ │
│ │ (uploads │ │ (column- │ │ (DB creds, API │ │
│ │ +archive)│ │ keys) │ │ keys, etc.) │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────────────────────────────┐│
│ │ SES │ │ ECR + CloudWatch Logs/Alarms ││
│ └──────────┘ └──────────────────────────────────┘│
└────────────────────────────────────────────────────┘
External services (sub-processors, disclosed in the DPA):
├── Clerk (auth, US)
├── Daily.co (telerehab video, US)
└── Anthropic (AI agents, US)Region: eu-central-1 (Frankfurt)
Frankfurt is the chosen region for production and staging. The decision is driven by:
- EU data residency — GDPR Day-1 requirement. All patient data must remain in the EU. Frankfurt sits inside the EU and the EEA.
- Latency to Romania — ~25-35ms RTT to Bucharest, the lowest of any AWS EU region.
- Service availability — every service we use (App Runner-class workloads on Fargate, Aurora Serverless v2, ElastiCache Redis 7, KMS, SES, etc.) is GA in eu-central-1.
- HIPAA BAA — accepted in AWS Artifact at the account level; covers every AWS service we use at no additional cost.
There is no multi-region architecture. Cross-region read replicas, multi-region writes, and per-tenant region selection are all out of scope per CLAUDE.md → Project Overview and scaling.md → Beyond Phase 2.
Services overview
| Service | Purpose | Production | Staging |
|---|---|---|---|
| ECS Fargate | Compute for all long-running processes | On-demand, Multi-AZ tasks | Spot, single-AZ tasks |
| RDS Postgres 17 | Primary database | db.t4g.medium Multi-AZ | — |
| Aurora Serverless v2 | Primary database | — | 0.5–2 ACU, scale-to-zero |
| ElastiCache Redis 7 | Rate limits, hold slots, cache-aside | cache.t4g.small + replica | cache.t4g.micro single node |
| Application Load Balancer | TLS termination, host-based routing | 1 ALB, multi-AZ | 1 ALB, single-AZ |
| S3 | File uploads + audit archives | 2 buckets, versioning, lifecycle | 2 buckets, shorter retention |
| KMS | Column-level encryption keys | 1 customer-managed key | 1 customer-managed key |
| Secrets Manager | DB creds, API keys, signing secrets | ~10 secrets | ~10 secrets |
| SES | Transactional email | Production identity, DKIM, suppression list | Sandbox or low-volume identity |
| ECR | Container registry | Shared between environments, lifecycle policy | Shared |
| CloudWatch | Logs + alarms + dashboards | Full alarm set, 90d log retention | Basic alarms, 30d log retention |
| VPC + NAT | Private networking | NAT Gateway (single AZ to start) | t4g.nano NAT instance |
| GitHub Actions OIDC | Deploy-time AWS auth | OIDC provider + deploy role | Same OIDC provider |
Networking
VPC layout
One VPC per environment (restartix-staging, restartix-production), CIDR 10.0.0.0/16. Each VPC has:
- 2 public subnets (one per AZ) — host the NAT Gateway / NAT instance and the ALB
- 2 private subnets (one per AZ) — host all Fargate tasks, RDS, ElastiCache
- VPC endpoints for S3, ECR, Secrets Manager, KMS, CloudWatch Logs — eliminate NAT egress for AWS-internal traffic and reduce data-processing costs
Production uses both AZs for Multi-AZ. Staging deploys everything to a single AZ for cost; the second AZ exists in the VPC layout but has no resources running in it.
Security groups
| Group | Inbound | Source |
|---|---|---|
alb | 443 from internet (TCP) | 0.0.0.0/0 |
fargate-app | App ports (9000, 9100, 9200, 9300, 4000) from ALB | alb SG |
fargate-pgbouncer | 6432 from app tasks | fargate-app SG |
rds | 5432 from pgbouncer + migration runner | fargate-pgbouncer SG, migrations-runner SG |
redis | 6379 from app tasks | fargate-app SG |
Database is never exposed to the internet. Direct psql access from a developer laptop goes through AWS SSM Session Manager port forwarding to a dedicated migrations-runner-style task or the pgbouncer task. No SSH bastion, no RDS public endpoint.
Egress to internet
App tasks reach external services (Clerk, Daily.co, Anthropic, Cloudflare) via:
- Production: NAT Gateway, single AZ to start (~$38/mo + per-GB processing). HA NAT (one per AZ) is a Phase 2 upgrade once traffic justifies it.
- Staging: t4g.nano NAT instance (~$3/mo). Single point of failure is acceptable for a staging environment; the trade-off is documented.
VPC endpoints handle S3, ECR, Secrets Manager, KMS, and CloudWatch Logs without going through NAT, which keeps NAT Gateway processing costs minimal.
Inbound from internet
All HTTPS traffic comes through Cloudflare. The ALB accepts traffic only from Cloudflare's IP ranges (security group rule), which protects the origin from direct DDoS and prevents bypass of WAF rules. Cloudflare passes through:
- Original
Hostheader for tenant resolution inproxy.ts X-Forwarded-ForandCF-Connecting-IPfor client IP recording in audit logsX-Forwarded-Hostwhen traffic arrives via Cloudflare for SaaS custom-domain routing
Compute: ECS Fargate
A single ECS cluster per environment (restartix-staging, restartix-production) hosts every long-running process. Fargate is used everywhere — there are no EC2 instances managed by the platform.
Service breakdown
| Service | Image | Production | Staging |
|---|---|---|---|
| Core API | services/api/cmd/api | 2× (1 vCPU / 2 GB), scale 2–10 on CPU | 1× (0.5 vCPU / 1 GB), Spot |
| Telemetry API (Layer 2, ships post-foundation) | services/telemetry | 2× (0.5 vCPU / 1 GB) Multi-AZ | 1× (0.5 vCPU / 1 GB) Spot |
| Clinic app | apps/clinic | 2× (0.5 vCPU / 1 GB), scale 2–8 on CPU | 1× (0.25 vCPU / 0.5 GB), Spot |
| Portal app | apps/portal | 2× (0.5 vCPU / 1 GB), scale 2–8 on CPU | 1× (0.25 vCPU / 0.5 GB), Spot |
| Console app | apps/console | 1× (0.25 vCPU / 0.5 GB), fixed | 1× (0.25 vCPU / 0.5 GB), Spot |
| pgbouncer | services/api/deploy/pgbouncer | 2× (0.25 vCPU / 0.5 GB), one per AZ | 1× (0.25 vCPU / 0.5 GB) |
Why ECS Fargate everywhere, not App Runner. App Runner cannot run scheduled tasks (we need them for cmd/audit-partition-roll, cmd/usage-quota-reset, cmd/usage-summary-rollup, cmd/check-providers), cannot run init/migration containers as part of a service deploy, and cannot host TCP services (pgbouncer must be on Fargate either way). Mixing App Runner and Fargate means operating two compute platforms; consolidating on Fargate keeps the IaC single-shaped. See decisions.md → Why ECS Fargate over App Runner.
Auto-scaling
Each service has an Application Auto Scaling target with a target-tracking policy on average CPU utilization:
resource "aws_appautoscaling_policy" "core_api_cpu" {
policy_type = "TargetTrackingScaling"
target_value = 70.0
scale_in_cooldown = 300
scale_out_cooldown = 60
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
}Scale-out is fast (60s cooldown), scale-in is slow (5min cooldown) to avoid flapping. Bounds (min_capacity, max_capacity) live in Terraform — adjusting them is a PR + apply, no service restart.
The Console app does not auto-scale — it serves a small fixed audience (superadmins) and pinning to one task simplifies session-related debugging.
Spot vs on-demand
- Production: on-demand only. The cost premium over Spot is small at this scale and Spot evictions, while rare, are an operational distraction we don't need.
- Staging: Fargate Spot for everything except pgbouncer. Spot is ~70% cheaper. Eviction handling is automatic — ECS replaces the task within ~30 seconds.
pgbouncer stays on-demand even in staging because eviction would briefly drop the entire DB connection layer; the savings don't justify the noise.
Scheduled tasks
cmd/audit-partition-roll, cmd/usage-quota-reset, cmd/usage-summary-rollup, cmd/check-providers, and cmd/expired-sessions-sweep run as EventBridge Scheduler → ECS RunTask jobs:
| Task | Schedule | Purpose |
|---|---|---|
audit-partition-roll | Day 1 of each month, 02:00 UTC | Provisions next 3 monthly partitions for audit_log, audit_ai_provenance, webhook + inbound webhook tables, usage_records |
usage-quota-reset | Day 1 of each month, 00:05 UTC | Resets usage_quotas.current_units = 0, advances period_start_at / period_end_at |
usage-summary-rollup | Day 1 of each month, 03:00 UTC | Closes the prior month's usage_summaries row per (org × capability) |
check-providers | Every 5 min in staging, every 1 min in prod | Healthchecks every row in platform_service_providers, flips status on failure |
expired-sessions-sweep | Every 15 min, all envs | Finalizes orphan-expired break_glass_sessions + patient_impersonation_sessions (1B.11). Writes closed_at = expires_at + system-attributed close audit row with action_context = break_glass / impersonation |
Each scheduled task uses the same task definition family as the corresponding service binary (Core API for the audit/quota/rollup/check binaries) — different command, same image, same IAM role.
Connection pooling: pgbouncer on Fargate
The Core API uses pgx with two pools per process (admin + app role per P2). At fleet scale (5–10 Fargate tasks × 2 pools × 25 conns) that's 250–500 connections fanning out from the API tier. RDS's max_connections=200 would either reject connections or burn ~6 GB on idle pool slots without a pooler in front.
pgbouncer in transaction pool mode sits between the application tier and RDS, accepting up to 1000 client connections and multiplexing them onto a small set of backend connections (~25 per pgbouncer task). The application doesn't know there's a pooler — it connects to a different DSN.
Why pgbouncer, not RDS Proxy
RDS Proxy is the AWS-native alternative and would slot into Fargate cleanly. We don't use it because it pins client-to-backend connections when it sees prepared statements, and pgx uses named prepared statements by default (QueryExecModeCacheStatement). Pinning eliminates the multiplexing benefit — the entire reason for adding a pooler. The two ways to make RDS Proxy work would be (1) switching pgx to QueryExecModeCacheDescribe or QueryExecModeExec (loses ~10–30% query throughput on repeated queries) or (2) waiting for RDS Proxy to add protocol-level prepared-statement support (no published timeline).
Self-hosted pgbouncer 1.25 supports protocol-level prepared statements transparently with max_prepared_statements=200. The pgx default works, plan caching benefits intact. Trade-off: one extra Fargate service. pgbouncer is a single static binary with a single config file — minimal operational footprint.
Local vs AWS config
The pgbouncer.ini shipped at services/api/deploy/pgbouncer/ is identical to what the Fargate task uses except for the auth method:
| Local (docker-compose) | AWS (ECS Fargate) | |
|---|---|---|
| Image | edoburu/pgbouncer:v1.25.1-p0 | Same image, mirrored to ECR |
auth_type | plain (passwords in userlist.txt) | scram-sha-256 |
| Auth source | userlist.txt (committed) | auth_query against a SECURITY DEFINER Postgres function; pgbouncer's own credential from Secrets Manager |
| Backend host | postgres | RDS writer endpoint or Aurora Serverless v2 cluster endpoint |
| TLS to backend | disable | require |
| Replicas | 1 | 2 (one per AZ) behind ALB |
pool_mode | transaction | transaction |
max_prepared_statements | 200 | 200 |
default_pool_size | 25 | 25 |
max_client_conn | 1000 | 1000 |
Migrations bypass pgbouncer
golang-migrate uses session-scoped pg_advisory_lock to serialize migration runs across deploying instances. Advisory locks are session features — pgbouncer in transaction mode would release them mid-migration. Migrations run as a one-shot ECS task using DATABASE_DIRECT_URL, which points directly at the RDS or Aurora cluster endpoint. The migration task's security group is allowed direct port 5432 to the database for that task only. See deployment.md for the deploy-pipeline mechanics.
No session-mode Postgres features in runtime paths
Per P44: no advisory locks, no LISTEN/NOTIFY, no SET (use set_config(..., true) for transaction-scoped state), no temp tables. Anything session-scoped breaks under transaction-mode pooling.
Database
Production: RDS Postgres 17, Multi-AZ
Engine: PostgreSQL 17
Instance: db.t4g.medium (2 vCPU, 4 GB RAM)
Storage: 50 GB gp3 (3000 IOPS baseline), auto-scaling enabled (max 200 GB)
Multi-AZ: Enabled (synchronous standby in second AZ, automatic failover)
Encryption at rest: Enabled (AWS-managed key in Phase 1; CMK migration trigger documented below)
Encryption in transit: Required (rds.force_ssl = 1)
Public access: Disabled
Backup retention: 7 days (continuous WAL → PITR to any second within window)
Performance Insights: Enabled (free tier, 7-day retention)
Enhanced Monitoring: Enabled (1-minute granularity)Parameter group (custom):
shared_preload_libraries: pg_stat_statements
rds.force_ssl: 1
max_connections: 200
shared_buffers: 1GB
effective_cache_size: 3GB
work_mem: 32MB
maintenance_work_mem: 256MBExtensions (created at migration time, available in the engine):
pgcrypto,uuid-ossp— generalunaccent,pg_trgm— diacritic-folded picker search per 1A.16vector— pre-loaded for AI featurespg_stat_statements— slow-query observability
Staging: Aurora Serverless v2, single-AZ, scale-to-zero
Engine: aurora-postgresql 17
Instances: 1 writer (db.serverless), single-AZ
Capacity: 0.5–2 ACU, scale-to-zero enabled (idle compute = $0/hr)
Storage: Aurora-managed, auto-scales
Encryption at rest: Enabled (AWS-managed key)
Encryption in transit: Required
Backup retention: 1 day (staging — production-grade backups not needed)Same Postgres wire protocol, same extensions, same parameter shape. The application connects via DATABASE_URL / DATABASE_APP_URL exactly as it does to RDS. The cluster scales down to 0 ACU when idle (~5 minutes of inactivity), wakes in 5–15 seconds when traffic resumes — acceptable cold-start for a staging environment used by the dev team.
Why Aurora Serverless v2 for staging, RDS for production
- Staging is mostly idle. Multi-AZ RDS for staging would burn ~$110/mo on a database that nobody is hitting most of the day. ASv2 scale-to-zero is the meaningful win.
- Production wants predictable costs and predictable performance. RDS at a fixed instance size is easier to capacity-plan. ASv2 charges per ACU-hour; under sustained load it can exceed the equivalent fixed instance.
- Same operational plane. Both are managed RDS-family services — same console, same Terraform provider, same Secrets Manager pattern. Switching staging to RDS later (or production to ASv2) is a parameter change, not a re-architecture.
See decisions.md → Why Aurora Serverless v2 for staging only for the full rationale.
Direct-KMS keyring + BYOK (Phase 2+)
Phase 1 (current, includes 1E.3) — column-encryption keys are KMS-rooted via Secrets Manager envelope. A customer-managed KMS key is provisioned per environment ($1/mo per CMK, see cost shapes) and used as the envelope key for the restartix/{env}/encryption Secrets Manager secret containing the column-encryption keyring. At Core API boot, SecretsManager.GetSecretValue transparently decrypts via that CMK once; the resulting plaintext keyring is held in process memory; AES-256-GCM column operations are local from there. RDS at-rest encryption uses AWS-managed KMS in Phase 1 (HIPAA-eligible, free, auditable via CloudTrail) — the customer-managed CMK only fronts the column-encryption keys + the backup-envelope key in Phase 1. The internal/core/crypto/kmsKeyring stub is reserved for Phase 2 — direct per-data-key KMS calls — and is not in the Phase 1 hot path.
Phase 2 (deferred) — direct-KMS keyring + per-tenant key custody. Triggered by (a) the first US-based clinic signing and HIPAA BAA scope expanding beyond AWS-managed defaults, (b) the first paying dedicated-mode clinic contract requiring customer-managed key custody (BYOK) as a procurement gate, or (c) an external compliance audit (SOC 2 Type II, ISO 27001, HDS) flagging the SM-envelope shape as an exception to remediate. Until one of those fires, Phase 1 stays as-is.
What Phase 2 changes: RDS gets re-encrypted under a customer-managed CMK (snapshot → restore with new key, ~1h downtime window or live with read-replica promotion). The internal/core/crypto/kmsKeyring stub is implemented — at startup, fetch a KMS-encrypted DEK blob (from Secrets Manager or directly from the kms:GenerateDataKey API), decrypt via kms:Decrypt, hold plaintext in memory. Per-tenant key custody (BYOK) layers in: each tenant's encrypted columns can be sealed under a tenant-specific CMK, with the column code routing decrypts via the tenant ID. The USE_KMS_ENCRYPTION flag flips to true per environment as the rollout proceeds. The migration is not designed in detail here — that's the Phase 2 ADR's job. This paragraph exists to make the trajectory visible so Phase 1 code stays forward-compatible.
Cache: ElastiCache Redis
Production
Engine: Redis 7
Node type: cache.t4g.small (2 vCPU, 1.37 GB)
Replicas: 1 (Multi-AZ failover)
Encryption in transit: Enabled
Encryption at rest: Enabled
VPC: restartix-production, private subnetStaging
Engine: Redis 7
Node type: cache.t4g.micro (2 vCPU, 0.5 GB)
Replicas: 0 (single node, single-AZ)
Encryption in transit: Enabled
Encryption at rest: EnabledWhat it stores
- Rate-limit counters — sliding-window counters keyed by principal + endpoint
- Activity throttle — per-principal
last_activitywrite debounce - Hold slots (P30) — appointment slot holds with TTL during form fill (F4.4)
- Cache-aside reads (P45) — hot read paths via
internal/core/cache.Aside
Failure mode
Redis is gracefully degradable. If the node fails:
- Rate limiting stops working (all requests allowed) — operational alert, not user-visible
- Activity throttle stops debouncing (
humans.last_activityupdates fire on every request) — slightly more DB writes, no user impact - Hold slots fail closed (slot reservation falls back to synchronous DB-backed reservation)
- Cache-aside reads miss → DB hit, slower but correct
Production Multi-AZ ensures automatic failover within ~30 seconds. Staging accepts node restarts as routine.
Object storage: S3
Two buckets per environment, both encrypted (SSE-S3), versioning enabled, public access blocked at the account and bucket level.
Uploads bucket
restartix-uploads-{env} — patient files, signatures, exercise videos, generated PDFs, document attachments.
- Org-scoped key prefixes (
{org_id}/{surface}/...); cross-org keys rejected at the application layer - Signed URLs for upload (5-min TTL) and download (15-min TTL)
- MIME validation and size caps applied before signing — see internal/integration/s3/
- No lifecycle transitions (uploads are accessed throughout their lifetime)
Audit-archive bucket
restartix-audit-archive-{env} — long-term audit-log retention beyond the hot 12-month window in Postgres.
Lifecycle policy:
| Age | Storage class | Access pattern |
|---|---|---|
| 0–90 days | S3 Standard | Spot-check replays, occasional reads |
| 90–365 days | S3 Glacier Instant Retrieval | Audit replay on demand, 100ms retrieval |
| 365+ days | S3 Glacier Deep Archive | Compliance retention, ~12h restore |
At Glacier Deep Archive ($0.00099/GB-month), 6 years of audit logs cost effectively nothing. Without lifecycle transitions, Standard storage of multi-year audit data is ~25× more expensive.
Encryption keys: KMS
One customer-managed KMS key per environment, used as the Secrets Manager envelope key for the restartix/{env}/encryption secret that holds the column-encryption keyring. The same CMK also envelopes restartix/{env}/encryption.BACKUP_ENCRYPTION_KEY (the pg_dump envelope key for Layer 2 backups). Column-encryption keys (pii_regulated, auth_secret columns per data-classification.md) are loaded from that SM secret into Core API memory at startup; AES-256-GCM operations are local per row from there.
Envelope encryption keeps KMS API costs negligible. Each Core API task triggers exactly one kms:Decrypt call at startup (transparent inside SecretsManager.GetSecretValue); column operations after that don't touch KMS. CloudTrail logs the per-task SM/KMS access, which is the audit signal customer-managed CMK exists to provide.
The internal/core/crypto/kmsKeyring stub is Phase 2 work (see Direct-KMS keyring + BYOK above) — Phase 1 uses the in-memory keyring loaded from the KMS-protected SM secret, not direct per-data-key KMS calls.
KMS key policy is restricted to the application IAM role (Fargate task role) for the kms:Decrypt action against the restartix/{env}/encryption SM secret context, and to the operations IAM role for Encrypt / Decrypt / GenerateDataKey (key rotation, manual decryption for incident response, future Phase 2 direct-KMS use).
Secrets management: Secrets Manager
Two classes of secrets sit in Secrets Manager. They look the same at provisioning time but behave differently at runtime, and the IaC + rotation runbooks need to treat them distinctly.
Canonical runtime secrets
Read on every process boot; Secrets Manager is the source of truth. Updating the value requires a task restart for the change to take effect.
restartix/{env}/database
├── DATABASE_URL (pgbouncer DSN, app role)
├── DATABASE_APP_URL (pgbouncer DSN, restricted app role for RLS)
├── DATABASE_DIRECT_URL (RDS/Aurora cluster endpoint, owner role, used by migrations)
└── DATABASE_PGBOUNCER_AUTH (pgbouncer's own credential for auth_query)
restartix/{env}/redis
└── REDIS_URL (rediss://... ElastiCache primary endpoint)
restartix/{env}/encryption
├── ENCRYPTION_KEYS (versioned column-encryption keys — see below)
└── BACKUP_ENCRYPTION_KEY (pg_dump envelope key)
restartix/{env}/clerk
└── CLERK_WEBHOOK_SECRET (Svix signing secret for inbound Clerk webhook events)
restartix/{env}/cloudflare
└── CLOUDFLARE_SAAS_API_TOKEN (Cloudflare for SaaS Custom Hostnames API token; scoped to the zone serving custom-domain registrations; consumed by `internal/integration/cloudflare-saas/`)ENCRYPTION_KEYS is the crown jewel. It encrypts and decrypts every pii_regulated / auth_secret column and the platform_service_providers.credentials_encrypted column that holds every Cat A provider credential. Loss = every Cat A provider locked out and every encrypted PII column unreadable. Rotation procedure lives in the credential-rotation runbook; the IaC ensures the secret is KMS-encrypted under the customer-managed CMK and that only the Fargate task role + the operations role can read it.
Cat A provider bootstrap-seed secrets
Read once, on the first boot of a fresh environment, by bootstrapProviderDefaults in services/api/cmd/api/main.go. The function inserts one row per Cat A capability into platform_service_providers (ON CONFLICT DO NOTHING). Once the row exists the env values are no longer load-bearing — the resolver reads from the DB row on every call (cached per task with a 5-minute TTL), and operators rotate via Console superadmin endpoints (PATCH /v1/admin/platform-service-providers/...), not by updating Secrets Manager + redeploying.
Re-seeding from Secrets Manager is an emergency recovery path: delete the row (DELETE FROM platform_service_providers WHERE capability = $1 AND organization_id IS NULL) and restart the binary. Normal rotation never touches Secrets Manager.
Currently bootstrapped (1C.2 shipped 2026-05-06 — three Cat A capabilities live):
restartix/{env}/email-bootstrap (capability: email — SES)
├── SES_FROM_ADDRESS (gates the bootstrap; empty = skip)
├── SES_CONFIGURATION_SET
└── SES_ENDPOINT_URL (LocalStack only; empty in real envs)
restartix/{env}/storage-bootstrap (capability: objectstore — S3)
├── AWS_BUCKET_NAME (gates the bootstrap; empty = skip)
├── AWS_S3_ENDPOINT_URL (LocalStack only)
└── AWS_S3_USE_PATH_STYLE (LocalStack only)
restartix/{env}/clerk-bootstrap (capability: auth — Clerk)
└── CLERK_SECRET_KEY (gates the bootstrap; empty = skip)Future Cat A capabilities add their own bootstrap secrets when their providers.Bootstrap calls land — listed here so the secret namespace is reserved up front:
(F9 telerehab — Cat A video) restartix/{env}/daily-bootstrap
└── DAILY_API_KEY
(AI agents — Cat A inference) restartix/{env}/anthropic-bootstrap
└── ANTHROPIC_API_KEYAWS credentials for SES + S3. The bootstrap marshals AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY into the seed payload. In Fargate the task IAM role is the credential — those env values are empty and the SES / S3 factories fall back to the AWS SDK default credential chain, which picks up the task role. The bootstrap row's empty access-key fields are intentional, not a bug to fix.
Injection
Fargate task definitions reference secrets via secrets: blocks; the values are injected into the task environment at startup, not stored in the task definition itself. The IaC has two distinct rotation contracts to implement:
- Runtime secret rotation — Secrets Manager update → ECS service restart (rolling task replacement) so tasks pick up the new value. No automatic in-task reload.
- Cat A provider rotation — Console superadmin
PATCHagainstplatform_service_providers→Resolver.Invalidateclears the local cache → cross-fleet propagation completes within the 5-minute TTL. Secrets Manager is not updated; no redeploy.
Container registry: ECR
One repository per service:
restartix-core-apirestartix-telemetry-apirestartix-clinicrestartix-portalrestartix-consolerestartix-pgbouncer(ECR-mirrored copy ofedoburu/pgbouncer)
Lifecycle policy on every repo:
Rule 1: Delete untagged images older than 7 days
Rule 2: Keep only the last 20 tagged imagesWithout this, ECR storage at $0.10/GB-month grows unbounded. With four services × frequent deploys × 200–500 MB Next.js images, this matters.
ECR is shared between staging and production — the same image SHA promoted from staging to production is the same byte-identical artifact, which is the point.
Edge: Cloudflare
Cloudflare sits in front of the AWS ALB and handles four jobs.
CDN: static assets via Cloudflare
Cloudflare caches static assets at the edge. The cache rules:
| Path | Cloudflare cache | Why |
|---|---|---|
/_next/static/* | Forever (1 year, immutable) | Next.js stamps content hashes into filenames; new build → new URLs, no cache bust needed |
/_next/image/* | Short TTL (1 hour) by URL | Server-rendered image transforms; URL parameters define the variant |
| HTML routes (everything else) | Never cached | Per-tenant, per-user; cookies + auth state make every response unique |
Patient health data never reaches the CDN. HTML responses bypass Cloudflare's cache by default because Next.js sets cache-control: private, no-store. Static assets (JS bundles, CSS, fonts, icons) are public by design.
WAF + DDoS
Cloudflare's WAF runs on every request — managed rule sets (OWASP Top 10), bot fight mode, rate-limiting at the edge. WAF rules live in Cloudflare configuration, not Terraform's AWS provider.
We do not use AWS WAF — Cloudflare handles this at the edge before traffic reaches AWS. Duplicating the WAF layer would burn money for no incremental protection.
Cloudflare for SaaS: per-tenant custom domains
Clinics can register custom domains (e.g., physio-bucharest.ro) that route to their organization's app. Cloudflare for SaaS handles the per-tenant TLS that would otherwise require ACM cert juggling on the ALB.
The flow:
Clinic adds custom domain in Console
│
▼
Console calls Cloudflare for SaaS API: create custom hostname
│
▼
Cloudflare returns the CNAME target (cname.restartix.pro)
and the TXT verification token
│
▼
Clinic adds the records at their DNS provider
│
▼
Cloudflare verifies ownership, provisions Let's Encrypt cert,
adds the hostname to the SaaS hostname pool
│
▼
Traffic to physio-bucharest.ro → Cloudflare edge
→ TLS terminated with the per-tenant cert
→ forwarded to our AWS ALB origin
→ proxy.ts reads the original Host header (`X-Forwarded-Host`)
→ resolves the org via /v1/public/organizations/resolve?domain=What we build: a small Go integration with the Cloudflare for SaaS Custom Hostnames API (create, poll status, surface errors back to the admin UI). What Cloudflare handles: cert issuance, renewal, edge termination, fallback when a custom hostname's DNS misconfiguration is repaired.
The ALB origin holds one wildcard ACM cert for *.restartix.pro covering all platform subdomains ({slug}.clinic.restartix.pro, {slug}.portal.restartix.pro, console.restartix.pro) plus the Cloudflare-origin hostname. No per-clinic cert lifecycle on the ALB.
See decisions.md → Why Cloudflare for SaaS over rolling our own ACM-on-ALB cert flow.
DNS
Cloudflare hosts DNS for restartix.pro and serves all subdomain records. Route 53 is not used — adding it would split DNS authority across two providers for no benefit.
ACM certificate validation uses DNS validation via Cloudflare TXT records — one-time setup, then ACM auto-renews indefinitely.
Load balancing: ALB
One Application Load Balancer per environment, multi-AZ, behind Cloudflare.
- Listeners: 443 (HTTPS, ACM wildcard cert for
*.restartix.pro), 80 redirects to 443 - Target groups: one per Fargate service, each with its own health check path (
/health) - Routing rules: host-header-based —
*.clinic.restartix.pro→ clinic target group,*.portal.restartix.pro→ portal target group,console.restartix.pro→ console target group, custom-domain origin host → matching target group based onX-Forwarded-Hostresolution downstream - Sticky sessions: not enabled (services are stateless; future SSE / WebSocket features may need ALB stickiness, decided per-feature)
Health check paths use the existing Go /health endpoint (process is alive) rather than a deep /ready (deps reachable) — deep checks turn flaky DB into rolling unhealthy targets.
Observability: CloudWatch
Logs
One log group per service, retention configured at log group creation (CloudWatch's default is "Never expire" — easy to forget, expensive to leave):
| Log group | Retention |
|---|---|
/ecs/restartix-{env}/core-api | 90 days (production), 30 days (staging) |
/ecs/restartix-{env}/telemetry-api | 90 days (production), 30 days (staging) — provisioned when Layer 2 telemetry ships |
/ecs/restartix-{env}/clinic | 30 days |
/ecs/restartix-{env}/portal | 30 days |
/ecs/restartix-{env}/console | 30 days |
/ecs/restartix-{env}/pgbouncer | 14 days |
Alarms and dashboard
Full alarm catalog and dashboard layout live in monitoring.md. The summary:
- Application alarms — error rate, p99 latency, unhealthy target count, scheduled-task failure
- Database alarms — high connection count, high CPU, low free storage, replica lag (Multi-AZ)
- Infrastructure alarms — NAT Gateway error rate, ALB 5xx, Fargate task restart loops
- Cost alarms — AWS Budgets alerts at 50%, 80%, 100% of monthly budget
SNS topic restartix-alerts-{env} fans alarms out to email + Slack via AWS Chatbot.
Email: SES
Production
- Identity verification: production identity for the platform sender domain (e.g.,
[email protected]), SPF + DKIM + DMARC configured - Suppression list: account-level suppression auto-adds hard bounces and complaints
- Sandbox exit: requested via AWS support ticket before launch (typically 24–48h turnaround)
- Sending limits: production-grade quota negotiated based on expected volume
- Bounce / complaint webhooks: SES → SNS → Core API endpoint that flips affected recipients into the platform suppression table
Staging
- Sandbox or low-volume production identity, used only by automated tests and ad-hoc devbox flows
- Same DKIM setup so tested flows are realistic
Wired through capability layer
SES is a Cat A curated provider per 1C.2. Credentials live in platform_service_providers, not in environment variables. The notification dispatcher resolves the SES provider on every send; rotation is a Console superadmin action, not a redeploy. See foundation 1A.18.
Telemetry sub-stack
Telemetry is a Layer 2 feature, not a Day-1 launch service. The architecture is locked: separate Go service (services/telemetry/) for ingest, the same RDS Postgres cluster as Core API for clinical aggregates, and S3 for replay blobs. No separate compliance Postgres, no ClickHouse, no TimescaleDB. See decisions.md → Why telemetry is PG + S3, not ClickHouse for the redesign rationale and /telemetry/index.md for the full design.
Components when Layer 2 ships
- Telemetry API Fargate task — a separate ECS service alongside Core API in the same VPC and ALB. Sizing analogous to Core API at launch (single small task, horizontal scale at Tier 1+); auto-scaling by CPU. Multi-AZ in production, single-task on Spot in staging.
- Postgres aggregates — same RDS cluster as Core API. Adds
pose_session_metrics,pose_rep_metrics(monthly partitioned per P41),media_session_metrics,media_buffering_events(monthly partitioned). Plus updates to existingpatient_exercise_logs. No new instance. - S3 bucket
restartix-telemetry-{env}— replay blobs at{org_id}/{session_id}.bin.gz. Lifecycle: standard → IA at 90 days → Glacier at 1 year → expire at retention horizon. KMS-encrypted. Cross-region replication deferred to Phase 2 along with other S3 buckets. - Auth wiring — Cat F service-account principal for Telemetry → Core API callbacks (events.Bus publishing). Signed-session-token issuance helper in Core API; verifier in Telemetry API. HS256 secret in Secrets Manager, rotated per credential-rotation runbook.
- Network paths — Patient Portal browser → Telemetry API (public HTTPS, signed token in header). Telemetry API → S3 (VPC endpoint). Telemetry API → Core API for events.Bus publish (private subnet). No public Core-API-to-Telemetry path; no Telemetry-to-Postgres direct path (Telemetry only writes via the events.Bus / Core API subscriber).
Sizing & cost (Layer 2)
Conservative launch profile (matches Tier 0 — up to ~1k peak concurrent):
Telemetry API Fargate (single task @ 0.5 vCPU / 1 GB) ~$7 (staging Spot) / ~$40 (prod on-demand × 2 Multi-AZ)
S3 telemetry bucket ~$1 (staging) / ~$5–15 (prod, scales with active sessions)PG aggregate tables ride on the existing RDS instance — no incremental DB cost. No ClickHouse line item. No separate compliance Postgres line item.
Tier 3 (50k+ peak concurrent, if/when ClickHouse joins) would add a managed-CH cost line of ~$3–6k/year baseline plus per-volume charges. Not budgeted today; reachable via the swap-point interfaces (see /telemetry/index.md → Scaling roadmap).
Migrations and deploys
The contract: a git push origin master (after PR merge) builds Docker images, pushes to ECR, runs migrations as a one-shot Fargate task if they changed, then triggers ECS rolling deploys per service. Production requires a manual approval gate in GitHub Actions before the deploy fires.
Full pipeline mechanics — branch protection, OIDC federation, image tagging, migration handling, rollback — live in deployment.md.
Backups and DR
The full 3-2-1-1 architecture and runbooks live in backup-disaster-recovery.md. Summary by layer + when each closes:
- Layer 1 — RDS PITR + automated snapshots. Production: continuous WAL → PITR to any second within the 7-day retention window, daily automated snapshots, manual snapshots before risky deploys. Staging (Aurora Serverless v2): 1-day retention, staging-grade. Closes with 1E.3.
- Layer 2 — daily
pg_dumpto a separate S3 bucket (vendor-independent restore path; same pattern as legacy). Bucket has Object Lock COMPLIANCE mode + 7-year retention, lifecycle Standard → Glacier IA → Deep Archive, separate KMS context from RDS, separate IAM role. IaC + one end-to-end manual test passes at 1E.3 (substrate validation). Daily cadence in production is on at launch. Daily cadence in staging is a knob, off by default — enable for alarm tuning or production-launch dress rehearsal. - Layer 3 — cross-region replication of Layer 2 to a second EU region (eu-west-1 or eu-west-3). Closes after production launch (within first quarter), per the post-launch checklist in backup-disaster-recovery.md.
- Layer 4 — quarterly offline archive. Optional; only if state audit explicitly requires it.
S3 cross-region replication for the uploads bucket and the audit-archive bucket is also deferred to Phase 2 — separate from the Layer 3 backup question.
RPO / RTO targets, encryption-key separation, and the four restore runbooks (PITR, daily-backup, cross-region, offline) live in backup-disaster-recovery.md.
HIPAA & GDPR posture
- AWS BAA: accepted via AWS Artifact at the account level. Covers every AWS service in the stack.
- EU residency: all primary data in
eu-central-1. No replication outside the EU. - Sub-processors disclosed in the DPA:
- AWS (US-incorporated, EU data center; covered by SCCs + AWS DPA)
- Cloudflare (US-incorporated, EU data centers used; covered by SCCs + Cloudflare DPA)
- Clerk (US-based; auth only, no patient health data; SCCs)
- Daily.co (US-based; telerehab video; SCCs)
- Anthropic (US-based; AI inference, no PHI passes through; SCCs)
- Encryption obligations:
- In transit: TLS 1.2+ everywhere, enforced (
rds.force_ssl=1, ALB-only listener on 443, internal Fargate-to-Fargate uses cleartext on the private subnet — acceptable per AWS shared responsibility) - At rest: AWS-managed KMS for RDS, ElastiCache, S3, EBS, Secrets Manager
- Column-level: AES-256-GCM via internal/core/crypto/ for
pii_regulatedandauth_secretcolumns per P12
- In transit: TLS 1.2+ everywhere, enforced (
- Audit retention: 6+ years per CLAUDE.md, achieved via the S3 audit-archive bucket lifecycle (Standard → Glacier IA → Glacier Deep Archive)
US-clinic / HIPAA-active path is deferred until the first US clinic signs. See the customer-managed KMS migration trigger above for the technical scope when that happens.
Cost: staging
ECS Fargate (Spot)
Core API (1× 0.5 vCPU / 1 GB) ~$7
Clinic (1× 0.25 vCPU / 0.5 GB) ~$4
Portal (1× 0.25 vCPU / 0.5 GB) ~$4
Console (1× 0.25 vCPU / 0.5 GB) ~$4
pgbouncer (1× 0.25 vCPU / 0.5 GB, on-demand) ~$10
Telemetry API (Layer 2; not on Day 1) ~$7 (single Spot task when it ships)
Database
Aurora Serverless v2 (0.5–2 ACU, scale-to-zero) ~$15–30
(Telemetry rides on the same DB — no separate instance)
Cache
ElastiCache Redis cache.t4g.micro ~$13
Networking
ALB ~$20
NAT instance (t4g.nano) ~$3
Data transfer ~$1
Storage
S3 (uploads + archives, low volume) ~$1
S3 backup bucket (Layer 2; ~$0 staging,
cron off by default) ~$0
ECR (shared with prod) ~$1
Encryption / secrets
KMS (1 CMK, SM envelope) ~$1
Secrets Manager (~10 secrets) ~$4
IaC state backend
S3 native locking (no DynamoDB needed) ~$0
Observability
CloudWatch (30d retention, light alarms) ~$2
Email
SES (sandbox or minimal volume) ~$0
──────
AWS staging subtotal ~$90 (Day 1) / ~$97 (with Telemetry Layer 2)
Cloudflare
Free tier + Cloudflare for SaaS basic ~$7
──────
STAGING TOTAL ~$97 (Day 1) / ~$104 (with Telemetry Layer 2)The 1E.3 spec calls for "<$100/mo idle staging." Hit on Day 1; the Telemetry Layer 2 add of ~$7 keeps it within margin.
Cost: production day 1
ECS Fargate (on-demand, Multi-AZ)
Core API (2× 1 vCPU / 2 GB) ~$83
Clinic (2× 0.5 vCPU / 1 GB) ~$41
Portal (2× 0.5 vCPU / 1 GB) ~$41
Console (1× 0.25 vCPU / 0.5 GB) ~$10
pgbouncer (2× 0.25 vCPU / 0.5 GB) ~$21
Telemetry API (Layer 2; 2× 0.5 vCPU/1 GB) ~$40
Database
RDS db.t4g.medium Multi-AZ ~$124
RDS storage (50 GB gp3 + backups) ~$10
(Telemetry aggregates on same RDS)
Cache
ElastiCache Redis cache.t4g.small + replica ~$51
Networking
ALB ~$25
NAT Gateway (single AZ + processed GB) ~$43
Data transfer (origin → Cloudflare) ~$10
Storage
S3 uploads (~100 GB Standard) ~$5
S3 audit archives (lifecycled) ~$1
S3 backup bucket (Layer 2 daily pg_dump,
~5 GB at launch; ~$160/mo at 500 GB) ~$2
ECR (4 repos with lifecycle) ~$2
Encryption / secrets
KMS (1 CMK, SM envelope) ~$1
Secrets Manager ~$4
IaC state backend
S3 native locking (no DynamoDB needed) ~$0
Observability
CloudWatch (90d retention, alarms, dashboards) ~$20
Email
SES (~200k transactional emails) ~$20
──────
AWS production subtotal ~$510 (Day 1) / ~$555 (with Telemetry Layer 2 + S3)
Cloudflare
Pro plan ($20) ~$20
Cloudflare for SaaS ($7 + ~80 hostnames @ $0.10) ~$15
──────
PRODUCTION TOTAL ~$545 (Day 1) / ~$590 (with Telemetry Layer 2)External services (Clerk, Daily.co, Anthropic) are NOT AWS costs and are listed separately in scaling.md → Cost summary.
Phase 2 estimate (10–50 clinics)
Phase 2 changes (per scaling.md):
- RDS upgraded to db.r6g.large + 2 read replicas
- Fargate fleets scaled out (more tasks per service, possibly larger task sizes)
- ElastiCache to cache.t4g.small
- HA NAT (one Gateway per AZ)
- More Cloudflare for SaaS hostnames
Approximate AWS+Cloudflare total: ~$1,300–1,500/mo (telemetry included at Tier 1 sizing). Full breakdown lives in scaling.md.
What we don't use and why
| Service | Why not |
|---|---|
| AWS App Runner | Cannot host scheduled tasks, init containers, or TCP services (pgbouncer). Mixing it with Fargate doubles the operational surface. Fargate everywhere is simpler. |
| RDS Proxy | Pins client-to-backend connections on prepared statements; pgx uses prepared statements by default. pgbouncer transaction-mode is the working alternative. |
| Route 53 | Cloudflare is already the DNS authority for restartix.pro. Splitting DNS across two providers adds operational noise without benefit. |
| AWS WAF | Cloudflare's WAF runs at the edge, before traffic reaches AWS. Duplicating the WAF layer wastes money. |
| AWS Amplify Hosting | Designed for static sites; the Next.js apps are server-rendered and need the full container shape that Fargate provides. |
| AWS Certificate Manager for tenant custom domains | Per-clinic cert lifecycle on the ALB is real engineering work. Cloudflare for SaaS handles this as a configuration line. |
| Multi-region | Out of scope per CLAUDE.md → Project Overview and scaling.md → Beyond Phase 2. |
| Per-tenant dedicated infrastructure | Permanently out of scope. The dedicated tenancy mode (F13) is logical isolation (per-tenant Clerk org + RLS + provider_org_id + branded domains), not separate RDS / KMS / compute. |
| EC2 instances (other than the staging NAT instance) | Fargate covers every long-running compute need; EC2 management is operational debt we don't need. |
| EKS / Kubernetes | Same workload runs on Fargate with a fraction of the operational complexity. K8s would be appropriate at a scale we don't reach. |
Related documentation
- Deployment & CI/CD — pipeline mechanics, rollback, runbooks
- IaC layout — Terraform module structure
- Scaling plan — growth phases and triggers
- Backup & DR — full 3-2-1-1 backup architecture
- Monitoring — alarms, dashboards, incident response
- Decisions — architectural rationale (AWS choice, Fargate over App Runner, Cloudflare for SaaS, Terraform)
- External providers — sub-processor list with regions and contracts