Skip to content

AWS Infrastructure Strategy

RestartiX is moving from Railway to AWS. This document covers everything — from why we're making the move, to exactly what services we need, how to set them up, how they scale across all four growth phases, and what it costs.


Why we're moving to AWS

Railway served us well for early development, but production healthcare SaaS requires guarantees that Railway cannot provide:

ConcernRailway + NeonAWS
Uptime SLANone published (Railway), varies (Neon)99.99% (App Runner, RDS, S3)
HIPAA BAANot available (Railway), $500+/mo (Neon Enterprise)Available for free (just sign the agreement)
Incident transparencyLimitedFull public status page + personal health dashboard
Production reliabilityInconsistent — frequent Railway issuesBattle-tested by hospitals, banks, governments
Data residencyUS only (limited regions)30+ global regions, including EU for GDPR
Database isolationNeon is shared multi-tenant computeRDS is a dedicated instance — your machine, your resources
Scales to Phase 4Requires migration at Phase 2-3Same provider from Phase 1 through Phase 4

Bottom line: We were going to end up on AWS anyway (database, backups, enterprise tier). Moving everything now means zero provider migrations in the future, and we get production-grade reliability, HIPAA BAA, and private networking from day one.


What changes and what doesn't

What changesWhat doesn't change
Core API hosting: Railway → AWS App RunnerApplication code (zero changes)
Telemetry API hosting: Railway → AWS App RunnerAPI contracts and endpoints
Database: Neon → AWS RDS PostgreSQLDatabase schema, RLS policies, queries
Redis: Railway plugin → ElastiCacheCloudflare (CDN, WAF, DDoS)
Deploys: Railway CLI → GitHub Actions + ECRClerk (authentication)
Secrets: Railway env vars → AWS Secrets ManagerDaily.co (video calls)
Monitoring: Railway dashboard → CloudWatchS3 (already on AWS)
How clinics experience the product

AWS services map

Every AWS service we use and why. Nothing more — we don't use services we don't need.

Phase 1 (now)

PurposeAWS ServiceWhy this one
Run Core APIApp RunnerRailway-like simplicity. Push container, it runs. Auto-scales.
Run Telemetry APIApp RunnerSame. Separate service, independent scaling.
Container registryECR (Elastic Container Registry)Store Docker images. App Runner pulls from here.
DatabaseRDS PostgreSQLDedicated instance, private networking, HIPAA BAA included, automated backups.
RedisElastiCache RedisVPC-private, encrypted, managed.
Private networkingVPC + Subnets + Security GroupsDatabase and Redis never exposed to the internet.
Outbound trafficNAT GatewayLets App Runner reach external services (Clerk, Daily.co) through the VPC.
SecretsSecrets ManagerStore DATABASE_URL, API keys, encryption keys. Rotatable, auditable.
File storageS3 (already using)No change.
BackupsRDS automated + S3RDS handles continuous backup. pg_dump to S3 for vendor independence.
MonitoringCloudWatchLogs, metrics, alarms. Comes free with App Runner.
CI/CDGitHub ActionsBuild → push to ECR → App Runner auto-deploys.
DNSCloudflare (not AWS)Already using. Stays. No need for Route 53.
SSL/TLSApp Runner (auto) + CloudflareBoth handle SSL. Zero config.

Added in later phases

PurposeAWS ServiceWhen
Read replicasRDS Read ReplicasPhase 2 (when read/write split needed)
Enterprise isolationMultiple App Runner + RDS per tenantPhase 3
Multi-regionApp Runner + RDS in eu-west-1, us-east-1Phase 4
Global routingDynamoDB (routing table)Phase 4

VPC explained (it's simpler than it sounds)

A VPC sounds scary but it's really just one idea: things that should talk to each other are in the same private room, and things that shouldn't can't get in.

Think of it like an office building:

┌─────────────────────────────────────────────────────────────┐
│  YOUR VPC (the building)                                     │
│                                                              │
│  ┌──────────────────────┐    ┌──────────────────────┐       │
│  │  Room A               │    │  Room B               │       │
│  │  (Private Subnet)     │    │  (Private Subnet)     │       │
│  │                       │    │                       │       │
│  │  PostgreSQL database  │    │  PostgreSQL standby   │       │
│  │  Redis cache          │    │  (automatic failover) │       │
│  │                       │    │                       │       │
│  └──────────────────────┘    └──────────────────────┘       │
│           ▲                           ▲                      │
│           │                           │                      │
│  ┌────────┴───────────────────────────┴──────────────┐      │
│  │  VPC Connector (the hallway)                       │      │
│  │  Only App Runner services have the key             │      │
│  └────────────────────────────────────────────────────┘      │
│           ▲                                                  │
│           │                                                  │
│  ┌────────┴──────────────────────────────────────────┐      │
│  │  NAT Gateway (the front door, outbound only)       │      │
│  │  Lets your apps call Clerk, Daily.co, etc.         │      │
│  │  Nobody outside can come in through it             │      │
│  └────────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────┘


┌──────────┴──────────────────────────────────────────┐
│  App Runner (your Go APIs)                           │
│  Lives outside the building but has a VPC Connector  │
│  — a private tunnel into the rooms                   │
└─────────────────────────────────────────────────────┘

What you actually create (one time, with my help):

ThingWhat it isAnalogy
VPCA private networkThe building
2 Private SubnetsSections of the network in different data centersTwo rooms on different floors (redundancy)
Security Group (RDS)Firewall rule: "only App Runner can connect on port 5432"A door lock that only your key opens
Security Group (Redis)Same but for port 6379Same concept, different door
VPC ConnectorA managed tunnel from App Runner into your VPCThe private hallway
NAT GatewayOutbound internet access for the VPCThe front door (exit only)

That's it. Six things, created once, never touched again. AWS manages them after creation. No servers to patch, no firewalls to configure manually, no networking knowledge needed.

Setup: creating the VPC (step by step)

The easiest path is the AWS Console wizard, which creates most of this in one click:

AWS Console → VPC → Create VPC

Choose: "VPC and more" (the wizard)

Settings:
  Name: restartix-prod
  IPv4 CIDR: 10.0.0.0/16    (just use this default)
  Number of AZs: 2           (minimum for RDS Multi-AZ)
  Public subnets: 2          (for NAT Gateway)
  Private subnets: 2         (for RDS + Redis)
  NAT Gateways: 1            (in 1 AZ — saves cost, sufficient for Phase 1)
  VPC Endpoints: S3           (free, faster S3 access)

Click "Create VPC"
→ Done. The wizard creates everything: subnets, route tables, NAT gateway, internet gateway.

Then create the security groups:

AWS Console → VPC → Security Groups → Create

Security Group 1: restartix-rds
  VPC: restartix-prod
  Inbound rule:
    Type: PostgreSQL (port 5432)
    Source: (the App Runner VPC Connector security group)
  That's the only rule. Nothing else can reach the database.

Security Group 2: restartix-redis
  VPC: restartix-prod
  Inbound rule:
    Type: Custom TCP (port 6379)
    Source: (the App Runner VPC Connector security group)

Then create the VPC Connector for App Runner:

AWS Console → App Runner → VPC Connectors → Create

Name: restartix-vpc-connector
VPC: restartix-prod
Subnets: Select both private subnets
Security Group: Create a new one (restartix-apprunner-connector)
  — This group needs outbound access to ports 5432, 6379, and 443

→ Done. Now attach this connector to your App Runner services.

After this one-time setup, you never touch the VPC again. It just sits there, keeping your database private.


Phase 1 architecture (current stage)

                                   ┌────────────────────────────────────┐
  Patients ─────► Cloudflare ─────►│  AWS App Runner                    │
  Specialists     (CDN, WAF,      │                                    │
  Admins          DDoS, SSL)      │  ┌──────────────────┐             │
                       │          │  │ Core API          │             │
                       │          │  │ (Go, auto-scale   │──┐         │
                       │          │  │  1-5 instances)    │  │         │
                       │          │  └──────────────────┘  │         │
                       │          │                         │ VPC     │
                       │          │  ┌──────────────────┐  │Connector│
                       └─────────►│  │ Telemetry API     │  │  │      │
                                  │  │ (Go, auto-scale   │──┘  │      │
                                  │  │  1-3 instances)    │     │      │
                                  │  └──────────────────┘     │      │
                                  └────────────────────────────┼──────┘

                                  ┌────────────────────────────▼──────┐
                                  │  AWS VPC (private network)         │
                                  │                                    │
                                  │  ┌────────────────┐               │
                                  │  │ RDS PostgreSQL  │               │
                                  │  │ db.t4g.medium   │               │
                                  │  │ (Multi-AZ)      │               │
                                  │  └────────────────┘               │
                                  │                                    │
                                  │  ┌────────────────┐               │
                                  │  │ ElastiCache     │               │
                                  │  │ Redis (1 GB)    │               │
                                  │  └────────────────┘               │
                                  └────────────────────────────────────┘

  Also on AWS:
  ├── S3: restartix-uploads-prod (patient files)
  ├── S3: restartix-backups-primary (database backups)
  ├── S3: restartix-backups-replica (cross-region backups)
  └── ECR: container images (Core API + Telemetry)

  CI/CD:
  └── GitHub Actions → build Docker → push to ECR → App Runner auto-deploys

RDS PostgreSQL setup (Phase 1)

A small but dedicated database instance. More than enough for 1-10 clinics and 100k patients.

yaml
Instance: db.t4g.medium (2 vCPU, 4 GB RAM)
Engine: PostgreSQL 17
Storage: 50 GB gp3 (3,000 IOPS baseline, auto-expand enabled)
Multi-AZ: Enabled (automatic failover to standby in another data center)
Encryption at rest: Enabled (AES-256, AWS-managed key)
Encryption in transit: Enabled (TLS required)
Backup retention: 7 days (automated, continuous)
PITR: Enabled (restore to any second in the last 7 days)
Public access: Disabled (VPC-private only)

Parameter Group (custom):
  max_connections: 200
  shared_buffers: 1GB
  effective_cache_size: 3GB
  work_mem: 32MB
  maintenance_work_mem: 256MB

Monitoring:
  Enhanced Monitoring: Enabled (1-minute granularity)
  Performance Insights: Enabled (free tier, 7-day retention)

Cost:
  Instance (db.t4g.medium):    ~$55/month
  Storage (50 GB gp3):         ~$8/month
  Backups:                     ~$5/month
  Total RDS:                   ~$68/month

Why db.t4g.medium for Phase 1:

  • 2 vCPU, 4 GB RAM is more than enough for 50-100 concurrent connections
  • Burstable — uses CPU credits during quiet periods, bursts for peak load
  • Multi-AZ gives automatic failover even on the smallest instance
  • Can resize to db.t4g.large (8 GB) or db.r6g.large (16 GB) later with minimal downtime

Connection math (Phase 1):

Core API: 3 instances × 20 pool = 60 connections
Telemetry API: 2 instances × 15 pool = 30 connections
Background jobs + monitoring: 10 connections
Total: ~100 connections
max_connections: 200 (50% headroom)

ElastiCache Redis setup (Phase 1)

yaml
Instance: cache.t4g.micro (2 vCPU, 0.5 GB)
Engine: Redis 7
Multi-AZ: No (Redis data is ephemeral — booking holds, rate limits, idempotency keys)
Encryption in transit: Enabled
Encryption at rest: Enabled
VPC: restartix-prod (same as RDS)
Security Group: restartix-redis

Cost: ~$12/month

App Runner service configuration

Core API:

yaml
Service: restartix-core-api
Source: ECR image (auto-deploy on new image push)

Instance:
  CPU: 1 vCPU
  Memory: 2 GB

Auto-scaling:
  Min instances: 1
  Max instances: 5
  Max concurrency: 100    # requests per instance before scaling up
  Max request timeout: 30s

Health check:
  Path: /health
  Protocol: HTTP
  Interval: 10s
  Timeout: 5s
  Healthy threshold: 1
  Unhealthy threshold: 5

Networking:
  VPC Connector: restartix-vpc-connector

Environment variables:
  DATABASE_URL: (from Secrets Manager — RDS private endpoint)
  REDIS_URL: (from Secrets Manager — ElastiCache private endpoint)
  CLERK_SECRET_KEY: (from Secrets Manager)
  CLERK_WEBHOOK_SECRET: (from Secrets Manager)
  S3_BUCKET: restartix-uploads-prod
  S3_REGION: eu-central-1
  DAILY_API_KEY: (from Secrets Manager)
  ENCRYPTION_KEY: (from Secrets Manager)
  APP_ENV: production
  LOG_LEVEL: info
  DB_POOL_MAX: 30
  PORT: 9000

Telemetry API:

yaml
Service: restartix-telemetry-api
Source: ECR image (auto-deploy on new image push)

Instance:
  CPU: 0.5 vCPU
  Memory: 1 GB

Auto-scaling:
  Min instances: 1
  Max instances: 3
  Max concurrency: 200    # telemetry events are lightweight
  Max request timeout: 10s

Health check:
  Path: /health
  Protocol: HTTP
  Interval: 10s
  Timeout: 5s

Networking:
  VPC Connector: restartix-vpc-connector

Environment variables:
  DATABASE_URL: (from Secrets Manager — RDS private endpoint)
  CLICKHOUSE_URL: (from Secrets Manager)
  APP_ENV: production
  LOG_LEVEL: info
  PORT: 4000

CI/CD pipeline: GitHub Actions

After this is set up, deploying is git push to main — identical to the Railway workflow.

yaml
# .github/workflows/deploy.yml

name: Build and Deploy to AWS

on:
  push:
    branches: [main]

env:
  AWS_REGION: eu-central-1
  ECR_REGISTRY: <account-id>.dkr.ecr.eu-central-1.amazonaws.com

jobs:
  deploy-api:
    name: Deploy Core API
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push Core API image
        run: |
          docker build -t restartix-core-api -f Dockerfile.api .
          docker tag restartix-core-api:latest $ECR_REGISTRY/restartix-core-api:latest
          docker tag restartix-core-api:latest $ECR_REGISTRY/restartix-core-api:${{ github.sha }}
          docker push $ECR_REGISTRY/restartix-core-api:latest
          docker push $ECR_REGISTRY/restartix-core-api:${{ github.sha }}

      # App Runner auto-deploys when a new image is pushed to ECR.
      # No additional step needed.

  deploy-telemetry:
    name: Deploy Telemetry API
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push Telemetry API image
        run: |
          docker build -t restartix-telemetry-api -f Dockerfile.telemetry .
          docker tag restartix-telemetry-api:latest $ECR_REGISTRY/restartix-telemetry-api:latest
          docker tag restartix-telemetry-api:latest $ECR_REGISTRY/restartix-telemetry-api:${{ github.sha }}
          docker push $ECR_REGISTRY/restartix-telemetry-api:latest
          docker push $ECR_REGISTRY/restartix-telemetry-api:${{ github.sha }}

What this does:

  1. You push code to main
  2. GitHub Actions builds Docker images for both services
  3. Images are pushed to ECR (AWS container registry)
  4. App Runner detects the new image and auto-deploys with zero downtime
  5. Old instances drain, new instances start — no manual intervention

Secrets management

All sensitive values move from Railway environment variables to AWS Secrets Manager.

yaml
Secret: restartix/production/core-api
Values:
  DATABASE_URL: postgres://restartix:[email protected]:5432/restartix?sslmode=require
  REDIS_URL: rediss://restartix-redis.xxx.euc1.cache.amazonaws.com:6379
  CLERK_SECRET_KEY: sk_live_xxx
  CLERK_WEBHOOK_SECRET: whsec_xxx
  DAILY_API_KEY: xxx
  ENCRYPTION_KEY: xxx
  BACKUP_ENCRYPTION_KEY: xxx

Secret: restartix/production/telemetry-api
Values:
  DATABASE_URL: postgres://restartix:[email protected]:5432/restartix?sslmode=require
  CLICKHOUSE_URL: https://xxx.clickhouse.cloud:8443

Cost: $0.40/secret/month + $0.05 per 10,000 API calls
      Total: ~$1-2/month

Why Secrets Manager instead of plain environment variables:

  • Secrets are encrypted at rest and in transit
  • Auditable — every access is logged in CloudTrail
  • Rotatable — can rotate keys without redeploying
  • HIPAA compliant — required for healthcare
  • One place to manage all secrets across services

Custom domains and SSL

Setup:

  1. App Runner provides a default URL: https://xxx.eu-central-1.awsapprunner.com
  2. Add your custom domain in App Runner console (e.g., api.restartix.com)
  3. App Runner gives you a CNAME record
  4. Add the CNAME in Cloudflare DNS
  5. SSL is handled automatically by both Cloudflare and App Runner

Cloudflare configuration stays the same:

  • DDoS protection
  • WAF + OWASP rulesets
  • Edge rate limiting
  • TLS termination
  • No changes needed

Monitoring and observability

App Runner sends logs and metrics to CloudWatch automatically. No setup needed.

Logs:

CloudWatch Log Groups:
  /aws/apprunner/restartix-core-api/application     → Application logs (slog JSON)
  /aws/apprunner/restartix-core-api/service          → App Runner system logs
  /aws/apprunner/restartix-telemetry-api/application → Telemetry logs
  /aws/apprunner/restartix-telemetry-api/service     → Telemetry system logs

Key alarms to set up:

yaml
Alarm: Core API High Error Rate
  Metric: 5xx count / total requests
  Threshold: > 1% for 5 minutes
  Action: SNS notification → email/Slack

Alarm: Core API High Latency
  Metric: p99 response time
  Threshold: > 1 second for 5 minutes
  Action: SNS notification → email/Slack

Alarm: Core API Unhealthy
  Metric: Health check failures
  Threshold: > 3 consecutive failures
  Action: SNS notification → email/Slack (critical)

Alarm: RDS High Connection Count
  Metric: DatabaseConnections
  Threshold: > 160 (80% of max_connections)
  Action: SNS notification → email/Slack

Alarm: RDS High CPU
  Metric: CPUUtilization
  Threshold: > 80% for 10 minutes
  Action: SNS notification → email

Alarm: RDS Low Free Storage
  Metric: FreeStorageSpace
  Threshold: < 10 GB
  Action: SNS notification → email

Alarm: Monthly Spend Approaching Budget
  Metric: AWS estimated charges
  Threshold: > $200/month (adjust as needed)
  Action: SNS notification → email

CloudWatch dashboard (create one):

  • Request count per minute (both services)
  • Error rate (5xx / total)
  • Response time (p50, p95, p99)
  • Active instances count
  • CPU and memory utilization per instance
  • RDS: connections, CPU, free storage, read/write IOPS
  • ElastiCache: memory usage, connections, cache hits/misses

Phase 1 cost estimate

App Runner:
  Core API (1 vCPU, 2 GB, 1-5 instances):    ~$25-40/month
  Telemetry API (0.5 vCPU, 1 GB, 1-3 inst):  ~$10-20/month

RDS PostgreSQL:
  db.t4g.medium (Multi-AZ):                   ~$55/month
  Storage (50 GB gp3):                         ~$8/month
  Automated backups:                           ~$5/month

ElastiCache Redis:
  cache.t4g.micro:                             ~$12/month

Networking:
  NAT Gateway:                                 ~$35/month
  Data transfer:                               ~$5/month

Other:
  Secrets Manager:                             ~$2/month
  ECR:                                         ~$1/month
  CloudWatch:                                  ~$5/month
                                               ──────────
AWS Total:                                     ~$163-188/month

External services (unchanged):
  S3 uploads:                                  ~$10-50/month
  Cloudflare:                                  Free
  Clerk:                                       $200/month
  Daily.co:                                    ~$100/month
                                               ──────────
GRAND TOTAL:                                   ~$473-538/month

Compared to Railway + Neon (~$475-565/month), this is the same cost or cheaper — and you get:

  • 99.99% SLA on everything
  • HIPAA BAA on everything (for free)
  • Database not exposed to the internet
  • Automated failover (Multi-AZ)
  • 7-day continuous PITR with 5-minute RPO
  • No connection limit ceiling
  • Dedicated database resources
  • No vendor migration later

Phase 2 architecture (months 12-24)

When read/write split is needed to handle more concurrent connections.

What changes from Phase 1

Phase 1:  App Runner → VPC → RDS (single primary, Multi-AZ)
Phase 2:  App Runner → VPC → RDS Primary (writes) + 2 Read Replicas (reads)

New infrastructure

The VPC stays the same. You just add read replicas.

yaml
RDS Primary (existing, upgraded):
  Instance: db.r6g.large (2 vCPU, 16 GB RAM)   # upgrade from t4g.medium
  Storage: 250 GB gp3

RDS Read Replica 1:
  Instance: db.r6g.large
  Same AZ as subnet B

RDS Read Replica 2:
  Instance: db.r6g.large
  Same AZ as subnet A (or add subnet C)

ElastiCache Redis (upgraded):
  Instance: cache.t4g.small (2 vCPU, 1.37 GB)

Connection routing

Mutations (POST/PUT/PATCH/DELETE) go to the primary. Reads (GET) go round-robin across replicas. This is handled in application middleware — no infrastructure change needed.

Phase 2 cost estimate

App Runner (Core API + Telemetry):     ~$50-70/month
RDS primary (db.r6g.large):        ~$200/month
RDS 2 read replicas:               ~$400/month
RDS storage (250 GB):              ~$25/month
RDS backups:                       ~$10/month
ElastiCache Redis:                 ~$25/month
NAT Gateway:                       ~$35/month
Other (Secrets, ECR, CW):          ~$10/month
                                   ──────────────
AWS Total:                         ~$755-795/month

External services:
  S3 uploads:                      ~$50-100/month
  Cloudflare:                      Free or $20/month
  Clerk:                           $200/month
  Daily.co:                        ~$200/month
                                   ──────────────
GRAND TOTAL:                       ~$1,205-1,315/month

Phase 3 architecture (months 24-36)

Two-tier system on AWS

Shared tier (90 small/medium clinics):

  • 1 App Runner service (Core API, scales to 10+ instances)
  • 1 RDS cluster (primary + 2 replicas, db.r6g.xlarge)
  • 1 ElastiCache Redis

Enterprise tier (10 large clinics, each gets dedicated infrastructure):

  • 1 App Runner service per enterprise clinic
  • 1 RDS instance per enterprise clinic (db.r6g.large)
  • Automated provisioning via script
Shared Tier                          Enterprise Tier
┌─────────────────────┐              ┌─────────────────────┐
│ App Runner: the Core API    │              │ App Runner: org-101 │
│ (10 instances)      │              │ (2 instances)       │
│         │           │              │         │           │
│    ┌────▼────┐      │              │    ┌────▼────┐      │
│    │ RDS     │      │              │    │ RDS     │      │
│    │ xlarge  │      │              │    │ large   │      │
│    │ + 2 rep │      │              │    │         │      │
│    └─────────┘      │              │    └─────────┘      │
└─────────────────────┘              └─────────────────────┘
                                     ┌─────────────────────┐
                                     │ App Runner: org-102 │
                                     │ ...                 │
                                     └─────────────────────┘
                                     (repeat for each enterprise org)

All enterprise infrastructure lives in the same VPC — different security groups isolate each tenant's database.

Automated enterprise provisioning

When a new enterprise clinic signs up, a script provisions their entire stack:

yaml
Provisioning creates:
  1. RDS instance (db.r6g.large, encrypted, Multi-AZ) in existing VPC
  2. Security group (only this tenant's App Runner service can connect)
  3. App Runner service (pointing to shared ECR image, with VPC Connector)
  4. Secrets Manager entry (connection strings)
  5. CloudWatch alarms
  6. Routing table entry (tenant_shards)
  7. Run database migrations

Provisioning time: ~15-20 minutes (RDS creation is the bottleneck)
Can be triggered by: Admin API endpoint or CLI command

Phase 3 cost estimate

Shared Tier:
  App Runner (10 instances):          ~$100/month
  RDS xlarge + 2 replicas:           ~$1,200/month
  ElastiCache:                        ~$25/month
  Shared Total:                       ~$1,325/month

Enterprise Tier (10 orgs):
  App Runner per org:                 ~$25/month
  RDS large per org:                  ~$200/month
  Per-org cost:                       ~$225/month
  Enterprise Total:                   10 × $225 = $2,250/month

NAT Gateway + networking:             ~$50/month
Other (Secrets, ECR, CW):            ~$20/month

AWS Infrastructure Total:             ~$3,645/month

Revenue:
  90 shared × $150/month:            $13,500/month
  10 enterprise × $1,500/month:      $15,000/month
  Total Revenue:                      $28,500/month
  Gross Margin:                       87% ($24,855/month)

Phase 4 architecture (months 36+)

Multi-region on AWS

AWS makes this straightforward because App Runner and RDS are available in every major region.

┌──────────────────────────────────────────────┐
│  EU Region (eu-central-1, Frankfurt)          │
│                                               │
│  VPC: restartix-eu                            │
│  Shared Shards: EU-1, EU-2, EU-3             │
│  Enterprise: 10 dedicated projects            │
│  Why: GDPR data residency for EU clinics      │
└──────────────────────────────────────────────┘

┌──────────────────────────────────────────────┐
│  US Region (us-east-1, Virginia)              │
│                                               │
│  VPC: restartix-us                            │
│  Shared Shards: US-1, US-2, US-3, US-4, US-5│
│  Enterprise: 30 dedicated projects            │
│  Why: US clinics, lowest latency              │
└──────────────────────────────────────────────┘

Global Routing:
  DynamoDB Global Table (replicated across regions)
  ├── organization_id → region + shard assignment
  ├── Cached in Redis (1 min TTL)
  └── Cloudflare Workers routes to correct region

Phase 4 cost estimate

Shared Tier (8 shards across 2 regions):
  8 × App Runner + RDS:              ~$3,600/month

Enterprise Tier (50 dedicated):
  50 × $225:                          ~$11,250/month

Global routing (DynamoDB):            ~$50/month
Cross-region data transfer:           ~$100/month
NAT Gateways (2 regions):            ~$70/month

AWS Infrastructure Total:             ~$15,070/month

Revenue:
  150 shared × $150:                  $22,500/month
  50 enterprise × $1,500:            $75,000/month
  Total Revenue:                      $97,500/month
  Gross Margin:                       85% ($82,430/month)

Cost summary across all phases

PhaseClinicsAWS infraExternal servicesTotal monthlyvs Railway+Neon path
11-10~$163-188~$310-350~$473-538Same cost, way better guarantees
210-50~$755-795~$450-520~$1,205-1,315~$200 more, but no Neon migration needed
350-100~$3,645~$500~$4,145Similar
4100-1000+~$15,070~$500~$15,570Similar

Phase 1 is the same cost as the Railway+Neon path. The difference: you get 99.99% SLA, HIPAA BAA, private networking, and zero provider migrations in the future. At Phase 2+, costs are comparable because both paths use similar RDS infrastructure.


AWS account setup guide

This section is for a solo developer who is not a DevOps engineer. Every step is explicit.

Step 1: AWS account

If you already have an AWS account (you do — for S3), skip to Step 2.

Otherwise:

  1. Go to aws.amazon.com → Create an AWS Account
  2. Use a business email, not personal
  3. Add a payment method
  4. Select the "Business" support plan ($29/month — worth it for production)

Step 2: Secure the account

1. Enable MFA on root account:
   AWS Console → IAM → Security credentials → Assign MFA device

2. Create an IAM user for daily work (never use root):
   IAM → Users → Create user
   Name: your-name-admin
   Attach policy: AdministratorAccess
   Enable console access + programmatic access

3. Create a deploy user (for GitHub Actions):
   IAM → Users → Create user
   Name: github-actions-deploy
   Attach policies:
     - AmazonEC2ContainerRegistryPowerUser (push images to ECR)
     - AWSAppRunnerFullAccess (manage App Runner)
   Programmatic access only (no console)
   Save the Access Key ID and Secret Access Key → add to GitHub Secrets

Step 3: Create the VPC

Follow the "Setup: creating the VPC" instructions in the VPC section above. One wizard, one click, done.

Step 4: Create RDS PostgreSQL

AWS Console → RDS → Create database

Settings:
  Engine: PostgreSQL 17
  Template: Production
  Instance: db.t4g.medium
  Storage: 50 GB gp3, enable auto-scaling (max 200 GB)
  Multi-AZ: Yes
  VPC: restartix-prod
  Subnet group: Create new (select both private subnets)
  Public access: No
  Security group: restartix-rds
  Database name: restartix
  Master username: restartix_admin
  Master password: (generate a strong one, save in Secrets Manager)
  Backup retention: 7 days
  Encryption: Enabled
  Enhanced monitoring: Enabled
  Performance Insights: Enabled

Click "Create database" → wait ~10 minutes

Step 5: Create ElastiCache Redis

AWS Console → ElastiCache → Create Redis cluster

Settings:
  Cluster mode: Disabled
  Node type: cache.t4g.micro
  Number of replicas: 0 (Phase 1, ephemeral data)
  Subnet group: Create new (select private subnets from restartix-prod VPC)
  Security group: restartix-redis
  Encryption in transit: Yes
  Encryption at rest: Yes

Click "Create" → wait ~5 minutes

Step 6: Create ECR repositories

AWS Console → ECR → Create repository

Repository 1: restartix-core-api
  - Visibility: Private
  - Image tag mutability: Mutable
  - Encryption: AES-256

Repository 2: restartix-telemetry-api
  - Same settings

Enable lifecycle policy (clean up old images):
  Rule: Delete untagged images older than 30 days
  Rule: Keep only last 10 tagged images

Step 7: Create App Runner services

AWS Console → App Runner → Create service

Service 1: restartix-core-api
  Source: Container registry → Amazon ECR
  Image: <account-id>.dkr.ecr.eu-central-1.amazonaws.com/restartix-core-api:latest
  Deployment: Automatic (deploy on new image push)
  Port: 8080

  Instance configuration:
    CPU: 1 vCPU
    Memory: 2 GB

  Auto scaling:
    Min instances: 1
    Max instances: 5
    Max concurrency: 100

  Health check:
    Path: /health
    Protocol: HTTP

  Networking:
    VPC Connector: restartix-vpc-connector (created in Step 3)

  Environment variables:
    (Add all from the "App Runner service configuration" section above)
    For secrets: Use "Reference a secret" → select from Secrets Manager

Service 2: restartix-telemetry-api
  Same process, different image, port (4000), and smaller instance (0.5 vCPU, 1 GB)

Step 8: Set up custom domain

In App Runner service → Custom domains → Link domain
  Domain: api.restartix.com

App Runner provides:
  CNAME record: xxx.acm-validations.aws
  CNAME target: xxx.eu-central-1.awsapprunner.com

In Cloudflare DNS:
  Add CNAME record:
    Name: api
    Target: (the value from App Runner)
    Proxy: Yes (orange cloud)

Step 9: Set up monitoring and billing alerts

AWS Console → CloudWatch → Alarms → Create alarm

Create the alarms listed in the "Monitoring and observability" section.

For notifications:
  SNS → Create topic: restartix-alerts
  Add subscription: your email
  (Optional: Add Slack webhook via AWS Chatbot)

AWS Console → Billing → Budgets → Create budget

Budget 1: Monthly spend
  Amount: $200 (adjust for your phase)
  Alert at: 80% and 100%
  Notify: your email

Database migration: Neon to RDS

Before migration

  1. Set up the full AWS infrastructure (Steps 1-9 above)
  2. Verify App Runner services are healthy on the default AWS URLs
  3. Run database migrations on the new RDS instance
  4. Schedule a maintenance window (communicate to clinics — expect ~30 minutes)

Migration steps

1. Put application in maintenance mode (return 503 for all requests)

2. Dump the Neon database:
   pg_dump $NEON_DATABASE_URL --format=custom --no-owner > restartix-dump.pgdump

3. Restore to RDS:
   pg_restore --host=restartix-prod.xxx.rds.amazonaws.com \
              --username=restartix_admin \
              --dbname=restartix \
              --no-owner \
              --verbose \
              restartix-dump.pgdump

4. Verify row counts match:
   Run SELECT count(*) FROM <table> on both databases for key tables

5. Update Secrets Manager:
   Change DATABASE_URL to the RDS endpoint

6. Restart App Runner services (they pick up the new secret)

7. Switch Cloudflare DNS to point to App Runner (if not already done)

8. Remove maintenance mode

9. Test critical flows: login, create appointment, view patients

10. Monitor for 24 hours

Rollback plan

If anything goes wrong:

  1. Change DATABASE_URL back to Neon in Secrets Manager
  2. Restart App Runner services
  3. Traffic returns to Neon within minutes
  4. Investigate at your own pace

Keep Neon running for 7 days after migration. Then cancel.


Backup strategy on AWS

The existing Backup & Disaster Recovery strategy remains the same with RDS replacing Neon as the primary database.

What changes

LayerBefore (Neon)After (RDS)
Layer 0: Live DBNeon serverlessRDS db.t4g.medium (dedicated)
Layer 1: Vendor backupsNeon PITR (7-30 days depending on plan)RDS automated backups (7-day PITR, 5-minute RPO)
Layer 2: Daily backupspg_dump → S3Same — pg_dump → S3 (vendor independence)
Layer 3: Cross-regionS3 replicationSame — S3 cross-region replication
Layer 4: OfflinePhysical mediaSame

RDS backup advantages

  • Automated continuous backup — AWS handles it, runs in the background
  • Point-in-time recovery — restore to any second in the last 7 days
  • Recovery point objective — 5 minutes (vs 24 hours with daily pg_dump alone)
  • Automated snapshots — daily, retained for 7 days
  • Manual snapshots — before migrations or risky changes, retained indefinitely
  • Cross-region replication — native, for disaster recovery in Phase 4
  • Multi-AZ — standby replica auto-promoted if primary fails

Daily pg_dump to S3 (Layer 2) continues as an independent safety layer — vendor independence matters even when the vendor is AWS.


Security and compliance on AWS

HIPAA compliance

1. Enable HIPAA eligibility on AWS account:
   AWS Console → AWS Artifact → Accept the AWS BAA
   (This is free — just a legal agreement)

2. HIPAA-eligible services we use:
   ✅ App Runner
   ✅ ECR
   ✅ RDS
   ✅ ElastiCache
   ✅ S3
   ✅ Secrets Manager
   ✅ CloudWatch
   ✅ IAM
   ✅ VPC

3. Encryption requirements (all met):
   ✅ Data at rest: RDS (AES-256), ElastiCache (AES-256), S3 (SSE-S3)
   ✅ Data in transit: TLS 1.2+ everywhere (enforced by security groups + config)
   ✅ Secrets: AWS Secrets Manager (AES-256)
   ✅ Database: VPC-private, no public access

IAM roles and least privilege

yaml
Role: AppRunnerInstanceRole
  Used by: App Runner services
  Permissions:
    - secretsmanager:GetSecretValue (restartix/production/*)
    - s3:PutObject, s3:GetObject (restartix-uploads-prod/*)
    - ecr:GetDownloadUrlForLayer, ecr:BatchGetImage

Role: GitHubActionsDeployRole
  Used by: GitHub Actions CI/CD
  Permissions:
    - ecr:PutImage, ecr:InitiateLayerUpload, ecr:CompleteLayerUpload
    - apprunner:UpdateService (if manual trigger needed)

Role: BackupJobRole
  Used by: Backup automation (Lambda or cron)
  Permissions:
    - s3:PutObject (restartix-backups-primary/*)
    - secretsmanager:GetSecretValue (backup encryption key only)
    - rds:CreateDBSnapshot (for manual pre-migration snapshots)

Principle: Each role can only do exactly what it needs. Nothing more.

Network security

yaml
Security Group: restartix-rds
  Inbound:
    - Port 5432 from restartix-apprunner-connector security group only
    - No public access. Not from your laptop. Not from anywhere else.
  Outbound:
    - None needed

Security Group: restartix-redis
  Inbound:
    - Port 6379 from restartix-apprunner-connector security group only
  Outbound:
    - None needed

Security Group: restartix-apprunner-connector
  Inbound:
    - None (App Runner initiates connections, doesn't receive them here)
  Outbound:
    - Port 5432 to restartix-rds (database)
    - Port 6379 to restartix-redis (cache)
    - Port 443 to 0.0.0.0/0 (HTTPS to Clerk, Daily.co, S3, etc. via NAT Gateway)

Result:
  Database and Redis are completely invisible to the internet.
  Only your App Runner services can reach them.
  Your App Runner services can reach external APIs through the NAT Gateway.

What you manage day-to-day

After the one-time setup, here is everything you need to do on an ongoing basis:

TaskHowFrequency
Deploy codegit push to mainWhenever you ship
Check if deploy succeededGitHub Actions tab or CloudWatchAfter each push
View application logsCloudWatch → Log groupsWhen debugging
Check service healthApp Runner console → service statusGlance weekly
Check database healthRDS → Performance InsightsGlance weekly
Review AWS billBilling dashboardMonthly
Respond to alarmsEmail/Slack notification → investigateWhen they fire
Rotate secretsSecrets Manager → update valueYearly (or when compromised)
Update Docker base imageChange 1 line in Dockerfile, git pushEvery few months
Resize RDS instanceConsole → Modify → pick larger instanceWhen Phase 2 triggers hit

What you never do:

  • Patch servers (App Runner is serverless, RDS is managed)
  • Renew SSL certificates (automatic)
  • Scale App Runner instances up or down (auto-scaling)
  • Manage load balancers (App Runner handles it)
  • Run database backups (RDS automated backups)
  • Manage VPC/networking (set once, never touch again)