Skip to content

AWS Infrastructure

The RestartiX platform runs on AWS in eu-central-1 (Frankfurt) with Cloudflare at the edge. Compute is ECS Fargate for every long-running process. Data lives in RDS Postgres (production) or Aurora Serverless v2 (staging) plus ElastiCache Redis and S3. Cloudflare handles DNS, CDN, WAF, and per-tenant custom-domain TLS via Cloudflare for SaaS. Infrastructure is managed entirely as code with Terraform.

This document describes the steady-state architecture — services, sizing, networking, costs. Operational concerns live in linked docs:


Provider stack at a glance

                    ┌────────────────────────────────────────────────────┐
   Patients ──────► │  Cloudflare (edge)                                  │
   Specialists      │  • DNS for restartix.pro                            │
   Admins           │  • CDN for /_next/static/*                          │
                    │  • WAF + DDoS + bot protection                      │
                    │  • Cloudflare for SaaS (per-clinic custom domains)  │
                    └────────────────┬───────────────────────────────────┘
                                     │ HTTPS

                    ┌────────────────────────────────────────────────────┐
                    │  AWS eu-central-1 (Frankfurt)                       │
                    │                                                     │
                    │  ┌─────────────────────────────────────────────┐   │
                    │  │  Application Load Balancer                   │   │
                    │  │  (TLS via ACM, host-based routing)           │   │
                    │  └────────────────┬────────────────────────────┘   │
                    │                   │                                  │
                    │  ┌────────────────▼────────────────────────────┐   │
                    │  │  ECS Fargate cluster                          │   │
                    │  │  ┌────────────┐ ┌────────────┐ ┌──────────┐ │   │
                    │  │  │ Core API   │ │ Telemetry  │ │ pgbouncer│ │   │
                    │  │  │ (Go)       │ │ API (Go)   │ │          │ │   │
                    │  │  └────────────┘ └────────────┘ └──────────┘ │   │
                    │  │  ┌────────────┐ ┌────────────┐ ┌──────────┐ │   │
                    │  │  │ Clinic     │ │ Portal     │ │ Console  │ │   │
                    │  │  │ (Next.js)  │ │ (Next.js)  │ │ (Next.js)│ │   │
                    │  │  └────────────┘ └────────────┘ └──────────┘ │   │
                    │  └──────────────┬─────────────┬────────────────┘   │
                    │                 │             │                      │
                    │  ┌──────────────▼──┐  ┌──────▼───────────┐         │
                    │  │ RDS Postgres 17 │  │ ElastiCache       │         │
                    │  │ (or Aurora SLv2) │  │ Redis             │         │
                    │  └─────────────────┘  └──────────────────┘         │
                    │                                                     │
                    │  ┌──────────┐  ┌──────────┐  ┌──────────────────┐ │
                    │  │ S3       │  │ KMS      │  │ Secrets Manager   │ │
                    │  │ (uploads │  │ (column- │  │ (DB creds, API    │ │
                    │  │ +archive)│  │ keys)    │  │  keys, etc.)      │ │
                    │  └──────────┘  └──────────┘  └──────────────────┘ │
                    │                                                     │
                    │  ┌──────────┐  ┌──────────────────────────────────┐│
                    │  │ SES      │  │ ECR + CloudWatch Logs/Alarms     ││
                    │  └──────────┘  └──────────────────────────────────┘│
                    └────────────────────────────────────────────────────┘

  External services (sub-processors, disclosed in the DPA):
  ├── Clerk (auth, US)
  ├── Daily.co (telerehab video, US)
  └── Anthropic (AI agents, US)

Region: eu-central-1 (Frankfurt)

Frankfurt is the chosen region for production and staging. The decision is driven by:

  • EU data residency — GDPR Day-1 requirement. All patient data must remain in the EU. Frankfurt sits inside the EU and the EEA.
  • Latency to Romania — ~25-35ms RTT to Bucharest, the lowest of any AWS EU region.
  • Service availability — every service we use (App Runner-class workloads on Fargate, Aurora Serverless v2, ElastiCache Redis 7, KMS, SES, etc.) is GA in eu-central-1.
  • HIPAA BAA — accepted in AWS Artifact at the account level; covers every AWS service we use at no additional cost.

There is no multi-region architecture. Cross-region read replicas, multi-region writes, and per-tenant region selection are all out of scope per CLAUDE.md → Project Overview and scaling.md → Beyond Phase 2.


Services overview

ServicePurposeProductionStaging
ECS FargateCompute for all long-running processesOn-demand, Multi-AZ tasksSpot, single-AZ tasks
RDS Postgres 17Primary databasedb.t4g.medium Multi-AZ
Aurora Serverless v2Primary database0.5–2 ACU, scale-to-zero
ElastiCache Redis 7Rate limits, hold slots, cache-asidecache.t4g.small + replicacache.t4g.micro single node
Application Load BalancerTLS termination, host-based routing1 ALB, multi-AZ1 ALB, single-AZ
S3File uploads + audit archives2 buckets, versioning, lifecycle2 buckets, shorter retention
KMSColumn-level encryption keys1 customer-managed key1 customer-managed key
Secrets ManagerDB creds, API keys, signing secrets~10 secrets~10 secrets
SESTransactional emailProduction identity, DKIM, suppression listSandbox or low-volume identity
ECRContainer registryShared between environments, lifecycle policyShared
CloudWatchLogs + alarms + dashboardsFull alarm set, 90d log retentionBasic alarms, 30d log retention
VPC + NATPrivate networkingNAT Gateway (single AZ to start)t4g.nano NAT instance
GitHub Actions OIDCDeploy-time AWS authOIDC provider + deploy roleSame OIDC provider

Networking

VPC layout

One VPC per environment (restartix-staging, restartix-production), CIDR 10.0.0.0/16. Each VPC has:

  • 2 public subnets (one per AZ) — host the NAT Gateway / NAT instance and the ALB
  • 2 private subnets (one per AZ) — host all Fargate tasks, RDS, ElastiCache
  • VPC endpoints for S3, ECR, Secrets Manager, KMS, CloudWatch Logs — eliminate NAT egress for AWS-internal traffic and reduce data-processing costs

Production uses both AZs for Multi-AZ. Staging deploys everything to a single AZ for cost; the second AZ exists in the VPC layout but has no resources running in it.

Security groups

GroupInboundSource
alb443 from internet (TCP)0.0.0.0/0
fargate-appApp ports (9000, 9100, 9200, 9300, 4000) from ALBalb SG
fargate-pgbouncer6432 from app tasksfargate-app SG
rds5432 from pgbouncer + migration runnerfargate-pgbouncer SG, migrations-runner SG
redis6379 from app tasksfargate-app SG

Database is never exposed to the internet. Direct psql access from a developer laptop goes through AWS SSM Session Manager port forwarding to a dedicated migrations-runner-style task or the pgbouncer task. No SSH bastion, no RDS public endpoint.

Egress to internet

App tasks reach external services (Clerk, Daily.co, Anthropic, Cloudflare) via:

  • Production: NAT Gateway, single AZ to start (~$38/mo + per-GB processing). HA NAT (one per AZ) is a Phase 2 upgrade once traffic justifies it.
  • Staging: t4g.nano NAT instance (~$3/mo). Single point of failure is acceptable for a staging environment; the trade-off is documented.

VPC endpoints handle S3, ECR, Secrets Manager, KMS, and CloudWatch Logs without going through NAT, which keeps NAT Gateway processing costs minimal.

Inbound from internet

All HTTPS traffic comes through Cloudflare. The ALB accepts traffic only from Cloudflare's IP ranges (security group rule), which protects the origin from direct DDoS and prevents bypass of WAF rules. Cloudflare passes through:

  • Original Host header for tenant resolution in proxy.ts
  • X-Forwarded-For and CF-Connecting-IP for client IP recording in audit logs
  • X-Forwarded-Host when traffic arrives via Cloudflare for SaaS custom-domain routing

Compute: ECS Fargate

A single ECS cluster per environment (restartix-staging, restartix-production) hosts every long-running process. Fargate is used everywhere — there are no EC2 instances managed by the platform.

Service breakdown

ServiceImageProductionStaging
Core APIservices/api/cmd/api2× (1 vCPU / 2 GB), scale 2–10 on CPU1× (0.5 vCPU / 1 GB), Spot
Telemetry API (Layer 2, ships post-foundation)services/telemetry2× (0.5 vCPU / 1 GB) Multi-AZ1× (0.5 vCPU / 1 GB) Spot
Clinic appapps/clinic2× (0.5 vCPU / 1 GB), scale 2–8 on CPU1× (0.25 vCPU / 0.5 GB), Spot
Portal appapps/portal2× (0.5 vCPU / 1 GB), scale 2–8 on CPU1× (0.25 vCPU / 0.5 GB), Spot
Console appapps/console1× (0.25 vCPU / 0.5 GB), fixed1× (0.25 vCPU / 0.5 GB), Spot
pgbouncerservices/api/deploy/pgbouncer2× (0.25 vCPU / 0.5 GB), one per AZ1× (0.25 vCPU / 0.5 GB)

Why ECS Fargate everywhere, not App Runner. App Runner cannot run scheduled tasks (we need them for cmd/audit-partition-roll, cmd/usage-quota-reset, cmd/usage-summary-rollup, cmd/check-providers), cannot run init/migration containers as part of a service deploy, and cannot host TCP services (pgbouncer must be on Fargate either way). Mixing App Runner and Fargate means operating two compute platforms; consolidating on Fargate keeps the IaC single-shaped. See decisions.md → Why ECS Fargate over App Runner.

Auto-scaling

Each service has an Application Auto Scaling target with a target-tracking policy on average CPU utilization:

hcl
resource "aws_appautoscaling_policy" "core_api_cpu" {
  policy_type        = "TargetTrackingScaling"
  target_value       = 70.0
  scale_in_cooldown  = 300
  scale_out_cooldown = 60
  predefined_metric_specification {
    predefined_metric_type = "ECSServiceAverageCPUUtilization"
  }
}

Scale-out is fast (60s cooldown), scale-in is slow (5min cooldown) to avoid flapping. Bounds (min_capacity, max_capacity) live in Terraform — adjusting them is a PR + apply, no service restart.

The Console app does not auto-scale — it serves a small fixed audience (superadmins) and pinning to one task simplifies session-related debugging.

Spot vs on-demand

  • Production: on-demand only. The cost premium over Spot is small at this scale and Spot evictions, while rare, are an operational distraction we don't need.
  • Staging: Fargate Spot for everything except pgbouncer. Spot is ~70% cheaper. Eviction handling is automatic — ECS replaces the task within ~30 seconds.

pgbouncer stays on-demand even in staging because eviction would briefly drop the entire DB connection layer; the savings don't justify the noise.

Scheduled tasks

cmd/audit-partition-roll, cmd/usage-quota-reset, cmd/usage-summary-rollup, cmd/check-providers, and cmd/expired-sessions-sweep run as EventBridge Scheduler → ECS RunTask jobs:

TaskSchedulePurpose
audit-partition-rollDay 1 of each month, 02:00 UTCProvisions next 3 monthly partitions for audit_log, audit_ai_provenance, webhook + inbound webhook tables, usage_records
usage-quota-resetDay 1 of each month, 00:05 UTCResets usage_quotas.current_units = 0, advances period_start_at / period_end_at
usage-summary-rollupDay 1 of each month, 03:00 UTCCloses the prior month's usage_summaries row per (org × capability)
check-providersEvery 5 min in staging, every 1 min in prodHealthchecks every row in platform_service_providers, flips status on failure
expired-sessions-sweepEvery 15 min, all envsFinalizes orphan-expired break_glass_sessions + patient_impersonation_sessions (1B.11). Writes closed_at = expires_at + system-attributed close audit row with action_context = break_glass / impersonation

Each scheduled task uses the same task definition family as the corresponding service binary (Core API for the audit/quota/rollup/check binaries) — different command, same image, same IAM role.


Connection pooling: pgbouncer on Fargate

The Core API uses pgx with two pools per process (admin + app role per P2). At fleet scale (5–10 Fargate tasks × 2 pools × 25 conns) that's 250–500 connections fanning out from the API tier. RDS's max_connections=200 would either reject connections or burn ~6 GB on idle pool slots without a pooler in front.

pgbouncer in transaction pool mode sits between the application tier and RDS, accepting up to 1000 client connections and multiplexing them onto a small set of backend connections (~25 per pgbouncer task). The application doesn't know there's a pooler — it connects to a different DSN.

Why pgbouncer, not RDS Proxy

RDS Proxy is the AWS-native alternative and would slot into Fargate cleanly. We don't use it because it pins client-to-backend connections when it sees prepared statements, and pgx uses named prepared statements by default (QueryExecModeCacheStatement). Pinning eliminates the multiplexing benefit — the entire reason for adding a pooler. The two ways to make RDS Proxy work would be (1) switching pgx to QueryExecModeCacheDescribe or QueryExecModeExec (loses ~10–30% query throughput on repeated queries) or (2) waiting for RDS Proxy to add protocol-level prepared-statement support (no published timeline).

Self-hosted pgbouncer 1.25 supports protocol-level prepared statements transparently with max_prepared_statements=200. The pgx default works, plan caching benefits intact. Trade-off: one extra Fargate service. pgbouncer is a single static binary with a single config file — minimal operational footprint.

Local vs AWS config

The pgbouncer.ini shipped at services/api/deploy/pgbouncer/ is identical to what the Fargate task uses except for the auth method:

Local (docker-compose)AWS (ECS Fargate)
Imageedoburu/pgbouncer:v1.25.1-p0Same image, mirrored to ECR
auth_typeplain (passwords in userlist.txt)scram-sha-256
Auth sourceuserlist.txt (committed)auth_query against a SECURITY DEFINER Postgres function; pgbouncer's own credential from Secrets Manager
Backend hostpostgresRDS writer endpoint or Aurora Serverless v2 cluster endpoint
TLS to backenddisablerequire
Replicas12 (one per AZ) behind ALB
pool_modetransactiontransaction
max_prepared_statements200200
default_pool_size2525
max_client_conn10001000

Migrations bypass pgbouncer

golang-migrate uses session-scoped pg_advisory_lock to serialize migration runs across deploying instances. Advisory locks are session features — pgbouncer in transaction mode would release them mid-migration. Migrations run as a one-shot ECS task using DATABASE_DIRECT_URL, which points directly at the RDS or Aurora cluster endpoint. The migration task's security group is allowed direct port 5432 to the database for that task only. See deployment.md for the deploy-pipeline mechanics.

No session-mode Postgres features in runtime paths

Per P44: no advisory locks, no LISTEN/NOTIFY, no SET (use set_config(..., true) for transaction-scoped state), no temp tables. Anything session-scoped breaks under transaction-mode pooling.


Database

Production: RDS Postgres 17, Multi-AZ

yaml
Engine: PostgreSQL 17
Instance: db.t4g.medium (2 vCPU, 4 GB RAM)
Storage: 50 GB gp3 (3000 IOPS baseline), auto-scaling enabled (max 200 GB)
Multi-AZ: Enabled (synchronous standby in second AZ, automatic failover)
Encryption at rest: Enabled (AWS-managed key in Phase 1; CMK migration trigger documented below)
Encryption in transit: Required (rds.force_ssl = 1)
Public access: Disabled
Backup retention: 7 days (continuous WAL → PITR to any second within window)
Performance Insights: Enabled (free tier, 7-day retention)
Enhanced Monitoring: Enabled (1-minute granularity)

Parameter group (custom):

shared_preload_libraries: pg_stat_statements
rds.force_ssl: 1
max_connections: 200
shared_buffers: 1GB
effective_cache_size: 3GB
work_mem: 32MB
maintenance_work_mem: 256MB

Extensions (created at migration time, available in the engine):

  • pgcrypto, uuid-ossp — general
  • unaccent, pg_trgm — diacritic-folded picker search per 1A.16
  • vector — pre-loaded for AI features
  • pg_stat_statements — slow-query observability

Staging: Aurora Serverless v2, single-AZ, scale-to-zero

yaml
Engine: aurora-postgresql 17
Instances: 1 writer (db.serverless), single-AZ
Capacity: 0.5–2 ACU, scale-to-zero enabled (idle compute = $0/hr)
Storage: Aurora-managed, auto-scales
Encryption at rest: Enabled (AWS-managed key)
Encryption in transit: Required
Backup retention: 1 day (staging — production-grade backups not needed)

Same Postgres wire protocol, same extensions, same parameter shape. The application connects via DATABASE_URL / DATABASE_APP_URL exactly as it does to RDS. The cluster scales down to 0 ACU when idle (~5 minutes of inactivity), wakes in 5–15 seconds when traffic resumes — acceptable cold-start for a staging environment used by the dev team.

Why Aurora Serverless v2 for staging, RDS for production

  • Staging is mostly idle. Multi-AZ RDS for staging would burn ~$110/mo on a database that nobody is hitting most of the day. ASv2 scale-to-zero is the meaningful win.
  • Production wants predictable costs and predictable performance. RDS at a fixed instance size is easier to capacity-plan. ASv2 charges per ACU-hour; under sustained load it can exceed the equivalent fixed instance.
  • Same operational plane. Both are managed RDS-family services — same console, same Terraform provider, same Secrets Manager pattern. Switching staging to RDS later (or production to ASv2) is a parameter change, not a re-architecture.

See decisions.md → Why Aurora Serverless v2 for staging only for the full rationale.

Direct-KMS keyring + BYOK (Phase 2+)

Phase 1 (current, includes 1E.3) — column-encryption keys are KMS-rooted via Secrets Manager envelope. A customer-managed KMS key is provisioned per environment ($1/mo per CMK, see cost shapes) and used as the envelope key for the restartix/{env}/encryption Secrets Manager secret containing the column-encryption keyring. At Core API boot, SecretsManager.GetSecretValue transparently decrypts via that CMK once; the resulting plaintext keyring is held in process memory; AES-256-GCM column operations are local from there. RDS at-rest encryption uses AWS-managed KMS in Phase 1 (HIPAA-eligible, free, auditable via CloudTrail) — the customer-managed CMK only fronts the column-encryption keys + the backup-envelope key in Phase 1. The internal/core/crypto/kmsKeyring stub is reserved for Phase 2 — direct per-data-key KMS calls — and is not in the Phase 1 hot path.

Phase 2 (deferred) — direct-KMS keyring + per-tenant key custody. Triggered by (a) the first US-based clinic signing and HIPAA BAA scope expanding beyond AWS-managed defaults, (b) the first paying dedicated-mode clinic contract requiring customer-managed key custody (BYOK) as a procurement gate, or (c) an external compliance audit (SOC 2 Type II, ISO 27001, HDS) flagging the SM-envelope shape as an exception to remediate. Until one of those fires, Phase 1 stays as-is.

What Phase 2 changes: RDS gets re-encrypted under a customer-managed CMK (snapshot → restore with new key, ~1h downtime window or live with read-replica promotion). The internal/core/crypto/kmsKeyring stub is implemented — at startup, fetch a KMS-encrypted DEK blob (from Secrets Manager or directly from the kms:GenerateDataKey API), decrypt via kms:Decrypt, hold plaintext in memory. Per-tenant key custody (BYOK) layers in: each tenant's encrypted columns can be sealed under a tenant-specific CMK, with the column code routing decrypts via the tenant ID. The USE_KMS_ENCRYPTION flag flips to true per environment as the rollout proceeds. The migration is not designed in detail here — that's the Phase 2 ADR's job. This paragraph exists to make the trajectory visible so Phase 1 code stays forward-compatible.


Cache: ElastiCache Redis

Production

yaml
Engine: Redis 7
Node type: cache.t4g.small (2 vCPU, 1.37 GB)
Replicas: 1 (Multi-AZ failover)
Encryption in transit: Enabled
Encryption at rest: Enabled
VPC: restartix-production, private subnet

Staging

yaml
Engine: Redis 7
Node type: cache.t4g.micro (2 vCPU, 0.5 GB)
Replicas: 0 (single node, single-AZ)
Encryption in transit: Enabled
Encryption at rest: Enabled

What it stores

  • Rate-limit counters — sliding-window counters keyed by principal + endpoint
  • Activity throttle — per-principal last_activity write debounce
  • Hold slots (P30) — appointment slot holds with TTL during form fill (F4.4)
  • Cache-aside reads (P45) — hot read paths via internal/core/cache.Aside

Failure mode

Redis is gracefully degradable. If the node fails:

  • Rate limiting stops working (all requests allowed) — operational alert, not user-visible
  • Activity throttle stops debouncing (humans.last_activity updates fire on every request) — slightly more DB writes, no user impact
  • Hold slots fail closed (slot reservation falls back to synchronous DB-backed reservation)
  • Cache-aside reads miss → DB hit, slower but correct

Production Multi-AZ ensures automatic failover within ~30 seconds. Staging accepts node restarts as routine.


Object storage: S3

Two buckets per environment, both encrypted (SSE-S3), versioning enabled, public access blocked at the account and bucket level.

Uploads bucket

restartix-uploads-{env} — patient files, signatures, exercise videos, generated PDFs, document attachments.

  • Org-scoped key prefixes ({org_id}/{surface}/...); cross-org keys rejected at the application layer
  • Signed URLs for upload (5-min TTL) and download (15-min TTL)
  • MIME validation and size caps applied before signing — see internal/integration/s3/
  • No lifecycle transitions (uploads are accessed throughout their lifetime)

Audit-archive bucket

restartix-audit-archive-{env} — long-term audit-log retention beyond the hot 12-month window in Postgres.

Lifecycle policy:

AgeStorage classAccess pattern
0–90 daysS3 StandardSpot-check replays, occasional reads
90–365 daysS3 Glacier Instant RetrievalAudit replay on demand, 100ms retrieval
365+ daysS3 Glacier Deep ArchiveCompliance retention, ~12h restore

At Glacier Deep Archive ($0.00099/GB-month), 6 years of audit logs cost effectively nothing. Without lifecycle transitions, Standard storage of multi-year audit data is ~25× more expensive.


Encryption keys: KMS

One customer-managed KMS key per environment, used as the Secrets Manager envelope key for the restartix/{env}/encryption secret that holds the column-encryption keyring. The same CMK also envelopes restartix/{env}/encryption.BACKUP_ENCRYPTION_KEY (the pg_dump envelope key for Layer 2 backups). Column-encryption keys (pii_regulated, auth_secret columns per data-classification.md) are loaded from that SM secret into Core API memory at startup; AES-256-GCM operations are local per row from there.

Envelope encryption keeps KMS API costs negligible. Each Core API task triggers exactly one kms:Decrypt call at startup (transparent inside SecretsManager.GetSecretValue); column operations after that don't touch KMS. CloudTrail logs the per-task SM/KMS access, which is the audit signal customer-managed CMK exists to provide.

The internal/core/crypto/kmsKeyring stub is Phase 2 work (see Direct-KMS keyring + BYOK above) — Phase 1 uses the in-memory keyring loaded from the KMS-protected SM secret, not direct per-data-key KMS calls.

KMS key policy is restricted to the application IAM role (Fargate task role) for the kms:Decrypt action against the restartix/{env}/encryption SM secret context, and to the operations IAM role for Encrypt / Decrypt / GenerateDataKey (key rotation, manual decryption for incident response, future Phase 2 direct-KMS use).


Secrets management: Secrets Manager

Two classes of secrets sit in Secrets Manager. They look the same at provisioning time but behave differently at runtime, and the IaC + rotation runbooks need to treat them distinctly.

Canonical runtime secrets

Read on every process boot; Secrets Manager is the source of truth. Updating the value requires a task restart for the change to take effect.

restartix/{env}/database
  ├── DATABASE_URL              (pgbouncer DSN, app role)
  ├── DATABASE_APP_URL          (pgbouncer DSN, restricted app role for RLS)
  ├── DATABASE_DIRECT_URL       (RDS/Aurora cluster endpoint, owner role, used by migrations)
  └── DATABASE_PGBOUNCER_AUTH   (pgbouncer's own credential for auth_query)

restartix/{env}/redis
  └── REDIS_URL                 (rediss://... ElastiCache primary endpoint)

restartix/{env}/encryption
  ├── ENCRYPTION_KEYS           (versioned column-encryption keys — see below)
  └── BACKUP_ENCRYPTION_KEY     (pg_dump envelope key)

restartix/{env}/clerk
  └── CLERK_WEBHOOK_SECRET      (Svix signing secret for inbound Clerk webhook events)

restartix/{env}/cloudflare
  └── CLOUDFLARE_SAAS_API_TOKEN (Cloudflare for SaaS Custom Hostnames API token; scoped to the zone serving custom-domain registrations; consumed by `internal/integration/cloudflare-saas/`)

ENCRYPTION_KEYS is the crown jewel. It encrypts and decrypts every pii_regulated / auth_secret column and the platform_service_providers.credentials_encrypted column that holds every Cat A provider credential. Loss = every Cat A provider locked out and every encrypted PII column unreadable. Rotation procedure lives in the credential-rotation runbook; the IaC ensures the secret is KMS-encrypted under the customer-managed CMK and that only the Fargate task role + the operations role can read it.

Cat A provider bootstrap-seed secrets

Read once, on the first boot of a fresh environment, by bootstrapProviderDefaults in services/api/cmd/api/main.go. The function inserts one row per Cat A capability into platform_service_providers (ON CONFLICT DO NOTHING). Once the row exists the env values are no longer load-bearing — the resolver reads from the DB row on every call (cached per task with a 5-minute TTL), and operators rotate via Console superadmin endpoints (PATCH /v1/admin/platform-service-providers/...), not by updating Secrets Manager + redeploying.

Re-seeding from Secrets Manager is an emergency recovery path: delete the row (DELETE FROM platform_service_providers WHERE capability = $1 AND organization_id IS NULL) and restart the binary. Normal rotation never touches Secrets Manager.

Currently bootstrapped (1C.2 shipped 2026-05-06 — three Cat A capabilities live):

restartix/{env}/email-bootstrap          (capability: email — SES)
  ├── SES_FROM_ADDRESS                   (gates the bootstrap; empty = skip)
  ├── SES_CONFIGURATION_SET
  └── SES_ENDPOINT_URL                   (LocalStack only; empty in real envs)

restartix/{env}/storage-bootstrap        (capability: objectstore — S3)
  ├── AWS_BUCKET_NAME                    (gates the bootstrap; empty = skip)
  ├── AWS_S3_ENDPOINT_URL                (LocalStack only)
  └── AWS_S3_USE_PATH_STYLE              (LocalStack only)

restartix/{env}/clerk-bootstrap          (capability: auth — Clerk)
  └── CLERK_SECRET_KEY                   (gates the bootstrap; empty = skip)

Future Cat A capabilities add their own bootstrap secrets when their providers.Bootstrap calls land — listed here so the secret namespace is reserved up front:

(F9 telerehab — Cat A video)             restartix/{env}/daily-bootstrap
                                           └── DAILY_API_KEY

(AI agents — Cat A inference)            restartix/{env}/anthropic-bootstrap
                                           └── ANTHROPIC_API_KEY

AWS credentials for SES + S3. The bootstrap marshals AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY into the seed payload. In Fargate the task IAM role is the credential — those env values are empty and the SES / S3 factories fall back to the AWS SDK default credential chain, which picks up the task role. The bootstrap row's empty access-key fields are intentional, not a bug to fix.

Injection

Fargate task definitions reference secrets via secrets: blocks; the values are injected into the task environment at startup, not stored in the task definition itself. The IaC has two distinct rotation contracts to implement:

  • Runtime secret rotation — Secrets Manager update → ECS service restart (rolling task replacement) so tasks pick up the new value. No automatic in-task reload.
  • Cat A provider rotation — Console superadmin PATCH against platform_service_providersResolver.Invalidate clears the local cache → cross-fleet propagation completes within the 5-minute TTL. Secrets Manager is not updated; no redeploy.

Container registry: ECR

One repository per service:

  • restartix-core-api
  • restartix-telemetry-api
  • restartix-clinic
  • restartix-portal
  • restartix-console
  • restartix-pgbouncer (ECR-mirrored copy of edoburu/pgbouncer)

Lifecycle policy on every repo:

Rule 1: Delete untagged images older than 7 days
Rule 2: Keep only the last 20 tagged images

Without this, ECR storage at $0.10/GB-month grows unbounded. With four services × frequent deploys × 200–500 MB Next.js images, this matters.

ECR is shared between staging and production — the same image SHA promoted from staging to production is the same byte-identical artifact, which is the point.


Edge: Cloudflare

Cloudflare sits in front of the AWS ALB and handles four jobs.

CDN: static assets via Cloudflare

Cloudflare caches static assets at the edge. The cache rules:

PathCloudflare cacheWhy
/_next/static/*Forever (1 year, immutable)Next.js stamps content hashes into filenames; new build → new URLs, no cache bust needed
/_next/image/*Short TTL (1 hour) by URLServer-rendered image transforms; URL parameters define the variant
HTML routes (everything else)Never cachedPer-tenant, per-user; cookies + auth state make every response unique

Patient health data never reaches the CDN. HTML responses bypass Cloudflare's cache by default because Next.js sets cache-control: private, no-store. Static assets (JS bundles, CSS, fonts, icons) are public by design.

WAF + DDoS

Cloudflare's WAF runs on every request — managed rule sets (OWASP Top 10), bot fight mode, rate-limiting at the edge. WAF rules live in Cloudflare configuration, not Terraform's AWS provider.

We do not use AWS WAF — Cloudflare handles this at the edge before traffic reaches AWS. Duplicating the WAF layer would burn money for no incremental protection.

Cloudflare for SaaS: per-tenant custom domains

Clinics can register custom domains (e.g., physio-bucharest.ro) that route to their organization's app. Cloudflare for SaaS handles the per-tenant TLS that would otherwise require ACM cert juggling on the ALB.

The flow:

Clinic adds custom domain in Console


Console calls Cloudflare for SaaS API: create custom hostname


Cloudflare returns the CNAME target (cname.restartix.pro)
and the TXT verification token


Clinic adds the records at their DNS provider


Cloudflare verifies ownership, provisions Let's Encrypt cert,
adds the hostname to the SaaS hostname pool


Traffic to physio-bucharest.ro → Cloudflare edge
→ TLS terminated with the per-tenant cert
→ forwarded to our AWS ALB origin
→ proxy.ts reads the original Host header (`X-Forwarded-Host`)
→ resolves the org via /v1/public/organizations/resolve?domain=

What we build: a small Go integration with the Cloudflare for SaaS Custom Hostnames API (create, poll status, surface errors back to the admin UI). What Cloudflare handles: cert issuance, renewal, edge termination, fallback when a custom hostname's DNS misconfiguration is repaired.

The ALB origin holds one wildcard ACM cert for *.restartix.pro covering all platform subdomains ({slug}.clinic.restartix.pro, {slug}.portal.restartix.pro, console.restartix.pro) plus the Cloudflare-origin hostname. No per-clinic cert lifecycle on the ALB.

See decisions.md → Why Cloudflare for SaaS over rolling our own ACM-on-ALB cert flow.

DNS

Cloudflare hosts DNS for restartix.pro and serves all subdomain records. Route 53 is not used — adding it would split DNS authority across two providers for no benefit.

ACM certificate validation uses DNS validation via Cloudflare TXT records — one-time setup, then ACM auto-renews indefinitely.


Load balancing: ALB

One Application Load Balancer per environment, multi-AZ, behind Cloudflare.

  • Listeners: 443 (HTTPS, ACM wildcard cert for *.restartix.pro), 80 redirects to 443
  • Target groups: one per Fargate service, each with its own health check path (/health)
  • Routing rules: host-header-based — *.clinic.restartix.pro → clinic target group, *.portal.restartix.pro → portal target group, console.restartix.pro → console target group, custom-domain origin host → matching target group based on X-Forwarded-Host resolution downstream
  • Sticky sessions: not enabled (services are stateless; future SSE / WebSocket features may need ALB stickiness, decided per-feature)

Health check paths use the existing Go /health endpoint (process is alive) rather than a deep /ready (deps reachable) — deep checks turn flaky DB into rolling unhealthy targets.


Observability: CloudWatch

Logs

One log group per service, retention configured at log group creation (CloudWatch's default is "Never expire" — easy to forget, expensive to leave):

Log groupRetention
/ecs/restartix-{env}/core-api90 days (production), 30 days (staging)
/ecs/restartix-{env}/telemetry-api90 days (production), 30 days (staging) — provisioned when Layer 2 telemetry ships
/ecs/restartix-{env}/clinic30 days
/ecs/restartix-{env}/portal30 days
/ecs/restartix-{env}/console30 days
/ecs/restartix-{env}/pgbouncer14 days

Alarms and dashboard

Full alarm catalog and dashboard layout live in monitoring.md. The summary:

  • Application alarms — error rate, p99 latency, unhealthy target count, scheduled-task failure
  • Database alarms — high connection count, high CPU, low free storage, replica lag (Multi-AZ)
  • Infrastructure alarms — NAT Gateway error rate, ALB 5xx, Fargate task restart loops
  • Cost alarms — AWS Budgets alerts at 50%, 80%, 100% of monthly budget

SNS topic restartix-alerts-{env} fans alarms out to email + Slack via AWS Chatbot.


Email: SES

Production

  • Identity verification: production identity for the platform sender domain (e.g., [email protected]), SPF + DKIM + DMARC configured
  • Suppression list: account-level suppression auto-adds hard bounces and complaints
  • Sandbox exit: requested via AWS support ticket before launch (typically 24–48h turnaround)
  • Sending limits: production-grade quota negotiated based on expected volume
  • Bounce / complaint webhooks: SES → SNS → Core API endpoint that flips affected recipients into the platform suppression table

Staging

  • Sandbox or low-volume production identity, used only by automated tests and ad-hoc devbox flows
  • Same DKIM setup so tested flows are realistic

Wired through capability layer

SES is a Cat A curated provider per 1C.2. Credentials live in platform_service_providers, not in environment variables. The notification dispatcher resolves the SES provider on every send; rotation is a Console superadmin action, not a redeploy. See foundation 1A.18.


Telemetry sub-stack

Telemetry is a Layer 2 feature, not a Day-1 launch service. The architecture is locked: separate Go service (services/telemetry/) for ingest, the same RDS Postgres cluster as Core API for clinical aggregates, and S3 for replay blobs. No separate compliance Postgres, no ClickHouse, no TimescaleDB. See decisions.md → Why telemetry is PG + S3, not ClickHouse for the redesign rationale and /telemetry/index.md for the full design.

Components when Layer 2 ships

  • Telemetry API Fargate task — a separate ECS service alongside Core API in the same VPC and ALB. Sizing analogous to Core API at launch (single small task, horizontal scale at Tier 1+); auto-scaling by CPU. Multi-AZ in production, single-task on Spot in staging.
  • Postgres aggregates — same RDS cluster as Core API. Adds pose_session_metrics, pose_rep_metrics (monthly partitioned per P41), media_session_metrics, media_buffering_events (monthly partitioned). Plus updates to existing patient_exercise_logs. No new instance.
  • S3 bucket restartix-telemetry-{env} — replay blobs at {org_id}/{session_id}.bin.gz. Lifecycle: standard → IA at 90 days → Glacier at 1 year → expire at retention horizon. KMS-encrypted. Cross-region replication deferred to Phase 2 along with other S3 buckets.
  • Auth wiring — Cat F service-account principal for Telemetry → Core API callbacks (events.Bus publishing). Signed-session-token issuance helper in Core API; verifier in Telemetry API. HS256 secret in Secrets Manager, rotated per credential-rotation runbook.
  • Network paths — Patient Portal browser → Telemetry API (public HTTPS, signed token in header). Telemetry API → S3 (VPC endpoint). Telemetry API → Core API for events.Bus publish (private subnet). No public Core-API-to-Telemetry path; no Telemetry-to-Postgres direct path (Telemetry only writes via the events.Bus / Core API subscriber).

Sizing & cost (Layer 2)

Conservative launch profile (matches Tier 0 — up to ~1k peak concurrent):

Telemetry API Fargate (single task @ 0.5 vCPU / 1 GB)   ~$7  (staging Spot) / ~$40 (prod on-demand × 2 Multi-AZ)
S3 telemetry bucket                                      ~$1 (staging) / ~$5–15 (prod, scales with active sessions)

PG aggregate tables ride on the existing RDS instance — no incremental DB cost. No ClickHouse line item. No separate compliance Postgres line item.

Tier 3 (50k+ peak concurrent, if/when ClickHouse joins) would add a managed-CH cost line of ~$3–6k/year baseline plus per-volume charges. Not budgeted today; reachable via the swap-point interfaces (see /telemetry/index.md → Scaling roadmap).


Migrations and deploys

The contract: a git push origin master (after PR merge) builds Docker images, pushes to ECR, runs migrations as a one-shot Fargate task if they changed, then triggers ECS rolling deploys per service. Production requires a manual approval gate in GitHub Actions before the deploy fires.

Full pipeline mechanics — branch protection, OIDC federation, image tagging, migration handling, rollback — live in deployment.md.


Backups and DR

The full 3-2-1-1 architecture and runbooks live in backup-disaster-recovery.md. Summary by layer + when each closes:

  • Layer 1 — RDS PITR + automated snapshots. Production: continuous WAL → PITR to any second within the 7-day retention window, daily automated snapshots, manual snapshots before risky deploys. Staging (Aurora Serverless v2): 1-day retention, staging-grade. Closes with 1E.3.
  • Layer 2 — daily pg_dump to a separate S3 bucket (vendor-independent restore path; same pattern as legacy). Bucket has Object Lock COMPLIANCE mode + 7-year retention, lifecycle Standard → Glacier IA → Deep Archive, separate KMS context from RDS, separate IAM role. IaC + one end-to-end manual test passes at 1E.3 (substrate validation). Daily cadence in production is on at launch. Daily cadence in staging is a knob, off by default — enable for alarm tuning or production-launch dress rehearsal.
  • Layer 3 — cross-region replication of Layer 2 to a second EU region (eu-west-1 or eu-west-3). Closes after production launch (within first quarter), per the post-launch checklist in backup-disaster-recovery.md.
  • Layer 4 — quarterly offline archive. Optional; only if state audit explicitly requires it.

S3 cross-region replication for the uploads bucket and the audit-archive bucket is also deferred to Phase 2 — separate from the Layer 3 backup question.

RPO / RTO targets, encryption-key separation, and the four restore runbooks (PITR, daily-backup, cross-region, offline) live in backup-disaster-recovery.md.


HIPAA & GDPR posture

  • AWS BAA: accepted via AWS Artifact at the account level. Covers every AWS service in the stack.
  • EU residency: all primary data in eu-central-1. No replication outside the EU.
  • Sub-processors disclosed in the DPA:
    • AWS (US-incorporated, EU data center; covered by SCCs + AWS DPA)
    • Cloudflare (US-incorporated, EU data centers used; covered by SCCs + Cloudflare DPA)
    • Clerk (US-based; auth only, no patient health data; SCCs)
    • Daily.co (US-based; telerehab video; SCCs)
    • Anthropic (US-based; AI inference, no PHI passes through; SCCs)
  • Encryption obligations:
    • In transit: TLS 1.2+ everywhere, enforced (rds.force_ssl=1, ALB-only listener on 443, internal Fargate-to-Fargate uses cleartext on the private subnet — acceptable per AWS shared responsibility)
    • At rest: AWS-managed KMS for RDS, ElastiCache, S3, EBS, Secrets Manager
    • Column-level: AES-256-GCM via internal/core/crypto/ for pii_regulated and auth_secret columns per P12
  • Audit retention: 6+ years per CLAUDE.md, achieved via the S3 audit-archive bucket lifecycle (Standard → Glacier IA → Glacier Deep Archive)

US-clinic / HIPAA-active path is deferred until the first US clinic signs. See the customer-managed KMS migration trigger above for the technical scope when that happens.


Cost: staging

ECS Fargate (Spot)
  Core API (1× 0.5 vCPU / 1 GB)             ~$7
  Clinic (1× 0.25 vCPU / 0.5 GB)            ~$4
  Portal (1× 0.25 vCPU / 0.5 GB)            ~$4
  Console (1× 0.25 vCPU / 0.5 GB)           ~$4
  pgbouncer (1× 0.25 vCPU / 0.5 GB, on-demand) ~$10
  Telemetry API (Layer 2; not on Day 1)      ~$7  (single Spot task when it ships)

Database
  Aurora Serverless v2 (0.5–2 ACU, scale-to-zero)  ~$15–30
  (Telemetry rides on the same DB — no separate instance)

Cache
  ElastiCache Redis cache.t4g.micro          ~$13

Networking
  ALB                                        ~$20
  NAT instance (t4g.nano)                    ~$3
  Data transfer                              ~$1

Storage
  S3 (uploads + archives, low volume)        ~$1
  S3 backup bucket (Layer 2; ~$0 staging,
    cron off by default)                     ~$0
  ECR (shared with prod)                     ~$1

Encryption / secrets
  KMS (1 CMK, SM envelope)                   ~$1
  Secrets Manager (~10 secrets)              ~$4

IaC state backend
  S3 native locking (no DynamoDB needed)     ~$0

Observability
  CloudWatch (30d retention, light alarms)   ~$2

Email
  SES (sandbox or minimal volume)            ~$0
                                             ──────
AWS staging subtotal                         ~$90 (Day 1) / ~$97 (with Telemetry Layer 2)

Cloudflare
  Free tier + Cloudflare for SaaS basic      ~$7
                                             ──────
STAGING TOTAL                                ~$97 (Day 1) / ~$104 (with Telemetry Layer 2)

The 1E.3 spec calls for "<$100/mo idle staging." Hit on Day 1; the Telemetry Layer 2 add of ~$7 keeps it within margin.

Cost: production day 1

ECS Fargate (on-demand, Multi-AZ)
  Core API (2× 1 vCPU / 2 GB)                ~$83
  Clinic (2× 0.5 vCPU / 1 GB)                ~$41
  Portal (2× 0.5 vCPU / 1 GB)                ~$41
  Console (1× 0.25 vCPU / 0.5 GB)            ~$10
  pgbouncer (2× 0.25 vCPU / 0.5 GB)          ~$21
  Telemetry API (Layer 2; 2× 0.5 vCPU/1 GB) ~$40

Database
  RDS db.t4g.medium Multi-AZ                 ~$124
  RDS storage (50 GB gp3 + backups)          ~$10
  (Telemetry aggregates on same RDS)

Cache
  ElastiCache Redis cache.t4g.small + replica  ~$51

Networking
  ALB                                        ~$25
  NAT Gateway (single AZ + processed GB)     ~$43
  Data transfer (origin → Cloudflare)        ~$10

Storage
  S3 uploads (~100 GB Standard)              ~$5
  S3 audit archives (lifecycled)             ~$1
  S3 backup bucket (Layer 2 daily pg_dump,
    ~5 GB at launch; ~$160/mo at 500 GB)     ~$2
  ECR (4 repos with lifecycle)               ~$2

Encryption / secrets
  KMS (1 CMK, SM envelope)                   ~$1
  Secrets Manager                            ~$4

IaC state backend
  S3 native locking (no DynamoDB needed)     ~$0

Observability
  CloudWatch (90d retention, alarms, dashboards)  ~$20

Email
  SES (~200k transactional emails)           ~$20
                                             ──────
AWS production subtotal                      ~$510 (Day 1) / ~$555 (with Telemetry Layer 2 + S3)

Cloudflare
  Pro plan ($20)                             ~$20
  Cloudflare for SaaS ($7 + ~80 hostnames @ $0.10)  ~$15
                                             ──────
PRODUCTION TOTAL                             ~$545 (Day 1) / ~$590 (with Telemetry Layer 2)

External services (Clerk, Daily.co, Anthropic) are NOT AWS costs and are listed separately in scaling.md → Cost summary.

Phase 2 estimate (10–50 clinics)

Phase 2 changes (per scaling.md):

  • RDS upgraded to db.r6g.large + 2 read replicas
  • Fargate fleets scaled out (more tasks per service, possibly larger task sizes)
  • ElastiCache to cache.t4g.small
  • HA NAT (one Gateway per AZ)
  • More Cloudflare for SaaS hostnames

Approximate AWS+Cloudflare total: ~$1,300–1,500/mo (telemetry included at Tier 1 sizing). Full breakdown lives in scaling.md.


What we don't use and why

ServiceWhy not
AWS App RunnerCannot host scheduled tasks, init containers, or TCP services (pgbouncer). Mixing it with Fargate doubles the operational surface. Fargate everywhere is simpler.
RDS ProxyPins client-to-backend connections on prepared statements; pgx uses prepared statements by default. pgbouncer transaction-mode is the working alternative.
Route 53Cloudflare is already the DNS authority for restartix.pro. Splitting DNS across two providers adds operational noise without benefit.
AWS WAFCloudflare's WAF runs at the edge, before traffic reaches AWS. Duplicating the WAF layer wastes money.
AWS Amplify HostingDesigned for static sites; the Next.js apps are server-rendered and need the full container shape that Fargate provides.
AWS Certificate Manager for tenant custom domainsPer-clinic cert lifecycle on the ALB is real engineering work. Cloudflare for SaaS handles this as a configuration line.
Multi-regionOut of scope per CLAUDE.md → Project Overview and scaling.md → Beyond Phase 2.
Per-tenant dedicated infrastructurePermanently out of scope. The dedicated tenancy mode (F13) is logical isolation (per-tenant Clerk org + RLS + provider_org_id + branded domains), not separate RDS / KMS / compute.
EC2 instances (other than the staging NAT instance)Fargate covers every long-running compute need; EC2 management is operational debt we don't need.
EKS / KubernetesSame workload runs on Fargate with a fraction of the operational complexity. K8s would be appropriate at a scale we don't reach.