AWS Infrastructure Strategy
RestartiX is moving from Railway to AWS. This document covers everything — from why we're making the move, to exactly what services we need, how to set them up, how they scale across all four growth phases, and what it costs.
Why we're moving to AWS
Railway served us well for early development, but production healthcare SaaS requires guarantees that Railway cannot provide:
| Concern | Railway + Neon | AWS |
|---|---|---|
| Uptime SLA | None published (Railway), varies (Neon) | 99.99% (App Runner, RDS, S3) |
| HIPAA BAA | Not available (Railway), $500+/mo (Neon Enterprise) | Available for free (just sign the agreement) |
| Incident transparency | Limited | Full public status page + personal health dashboard |
| Production reliability | Inconsistent — frequent Railway issues | Battle-tested by hospitals, banks, governments |
| Data residency | US only (limited regions) | 30+ global regions, including EU for GDPR |
| Database isolation | Neon is shared multi-tenant compute | RDS is a dedicated instance — your machine, your resources |
| Scales to Phase 4 | Requires migration at Phase 2-3 | Same provider from Phase 1 through Phase 4 |
Bottom line: We were going to end up on AWS anyway (database, backups, enterprise tier). Moving everything now means zero provider migrations in the future, and we get production-grade reliability, HIPAA BAA, and private networking from day one.
What changes and what doesn't
| What changes | What doesn't change |
|---|---|
| Core API hosting: Railway → AWS App Runner | Application code (zero changes) |
| Telemetry API hosting: Railway → AWS App Runner | API contracts and endpoints |
| Database: Neon → AWS RDS PostgreSQL | Database schema, RLS policies, queries |
| Redis: Railway plugin → ElastiCache | Cloudflare (CDN, WAF, DDoS) |
| Deploys: Railway CLI → GitHub Actions + ECR | Clerk (authentication) |
| Secrets: Railway env vars → AWS Secrets Manager | Daily.co (video calls) |
| Monitoring: Railway dashboard → CloudWatch | S3 (already on AWS) |
| How clinics experience the product |
AWS services map
Every AWS service we use and why. Nothing more — we don't use services we don't need.
Phase 1 (now)
| Purpose | AWS Service | Why this one |
|---|---|---|
| Run Core API | App Runner | Railway-like simplicity. Push container, it runs. Auto-scales. |
| Run Telemetry API | App Runner | Same. Separate service, independent scaling. |
| Container registry | ECR (Elastic Container Registry) | Store Docker images. App Runner pulls from here. |
| Database | RDS PostgreSQL | Dedicated instance, private networking, HIPAA BAA included, automated backups. |
| Redis | ElastiCache Redis | VPC-private, encrypted, managed. |
| Private networking | VPC + Subnets + Security Groups | Database and Redis never exposed to the internet. |
| Outbound traffic | NAT Gateway | Lets App Runner reach external services (Clerk, Daily.co) through the VPC. |
| Secrets | Secrets Manager | Store DATABASE_URL, API keys, encryption keys. Rotatable, auditable. |
| File storage | S3 (already using) | No change. |
| Backups | RDS automated + S3 | RDS handles continuous backup. pg_dump to S3 for vendor independence. |
| Monitoring | CloudWatch | Logs, metrics, alarms. Comes free with App Runner. |
| CI/CD | GitHub Actions | Build → push to ECR → App Runner auto-deploys. |
| DNS | Cloudflare (not AWS) | Already using. Stays. No need for Route 53. |
| SSL/TLS | App Runner (auto) + Cloudflare | Both handle SSL. Zero config. |
Added in later phases
| Purpose | AWS Service | When |
|---|---|---|
| Read replicas | RDS Read Replicas | Phase 2 (when read/write split needed) |
| Enterprise isolation | Multiple App Runner + RDS per tenant | Phase 3 |
| Multi-region | App Runner + RDS in eu-west-1, us-east-1 | Phase 4 |
| Global routing | DynamoDB (routing table) | Phase 4 |
VPC explained (it's simpler than it sounds)
A VPC sounds scary but it's really just one idea: things that should talk to each other are in the same private room, and things that shouldn't can't get in.
Think of it like an office building:
┌─────────────────────────────────────────────────────────────┐
│ YOUR VPC (the building) │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Room A │ │ Room B │ │
│ │ (Private Subnet) │ │ (Private Subnet) │ │
│ │ │ │ │ │
│ │ PostgreSQL database │ │ PostgreSQL standby │ │
│ │ Redis cache │ │ (automatic failover) │ │
│ │ │ │ │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ ▲ ▲ │
│ │ │ │
│ ┌────────┴───────────────────────────┴──────────────┐ │
│ │ VPC Connector (the hallway) │ │
│ │ Only App Runner services have the key │ │
│ └────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ │
│ ┌────────┴──────────────────────────────────────────┐ │
│ │ NAT Gateway (the front door, outbound only) │ │
│ │ Lets your apps call Clerk, Daily.co, etc. │ │
│ │ Nobody outside can come in through it │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
▲
│
┌──────────┴──────────────────────────────────────────┐
│ App Runner (your Go APIs) │
│ Lives outside the building but has a VPC Connector │
│ — a private tunnel into the rooms │
└─────────────────────────────────────────────────────┘What you actually create (one time, with my help):
| Thing | What it is | Analogy |
|---|---|---|
| VPC | A private network | The building |
| 2 Private Subnets | Sections of the network in different data centers | Two rooms on different floors (redundancy) |
| Security Group (RDS) | Firewall rule: "only App Runner can connect on port 5432" | A door lock that only your key opens |
| Security Group (Redis) | Same but for port 6379 | Same concept, different door |
| VPC Connector | A managed tunnel from App Runner into your VPC | The private hallway |
| NAT Gateway | Outbound internet access for the VPC | The front door (exit only) |
That's it. Six things, created once, never touched again. AWS manages them after creation. No servers to patch, no firewalls to configure manually, no networking knowledge needed.
Setup: creating the VPC (step by step)
The easiest path is the AWS Console wizard, which creates most of this in one click:
AWS Console → VPC → Create VPC
Choose: "VPC and more" (the wizard)
Settings:
Name: restartix-prod
IPv4 CIDR: 10.0.0.0/16 (just use this default)
Number of AZs: 2 (minimum for RDS Multi-AZ)
Public subnets: 2 (for NAT Gateway)
Private subnets: 2 (for RDS + Redis)
NAT Gateways: 1 (in 1 AZ — saves cost, sufficient for Phase 1)
VPC Endpoints: S3 (free, faster S3 access)
Click "Create VPC"
→ Done. The wizard creates everything: subnets, route tables, NAT gateway, internet gateway.Then create the security groups:
AWS Console → VPC → Security Groups → Create
Security Group 1: restartix-rds
VPC: restartix-prod
Inbound rule:
Type: PostgreSQL (port 5432)
Source: (the App Runner VPC Connector security group)
That's the only rule. Nothing else can reach the database.
Security Group 2: restartix-redis
VPC: restartix-prod
Inbound rule:
Type: Custom TCP (port 6379)
Source: (the App Runner VPC Connector security group)Then create the VPC Connector for App Runner:
AWS Console → App Runner → VPC Connectors → Create
Name: restartix-vpc-connector
VPC: restartix-prod
Subnets: Select both private subnets
Security Group: Create a new one (restartix-apprunner-connector)
— This group needs outbound access to ports 5432, 6379, and 443
→ Done. Now attach this connector to your App Runner services.After this one-time setup, you never touch the VPC again. It just sits there, keeping your database private.
Phase 1 architecture (current stage)
┌────────────────────────────────────┐
Patients ─────► Cloudflare ─────►│ AWS App Runner │
Specialists (CDN, WAF, │ │
Admins DDoS, SSL) │ ┌──────────────────┐ │
│ │ │ Core API │ │
│ │ │ (Go, auto-scale │──┐ │
│ │ │ 1-5 instances) │ │ │
│ │ └──────────────────┘ │ │
│ │ │ VPC │
│ │ ┌──────────────────┐ │Connector│
└─────────►│ │ Telemetry API │ │ │ │
│ │ (Go, auto-scale │──┘ │ │
│ │ 1-3 instances) │ │ │
│ └──────────────────┘ │ │
└────────────────────────────┼──────┘
│
┌────────────────────────────▼──────┐
│ AWS VPC (private network) │
│ │
│ ┌────────────────┐ │
│ │ RDS PostgreSQL │ │
│ │ db.t4g.medium │ │
│ │ (Multi-AZ) │ │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ │
│ │ ElastiCache │ │
│ │ Redis (1 GB) │ │
│ └────────────────┘ │
└────────────────────────────────────┘
Also on AWS:
├── S3: restartix-uploads-prod (patient files)
├── S3: restartix-backups-primary (database backups)
├── S3: restartix-backups-replica (cross-region backups)
└── ECR: container images (Core API + Telemetry)
CI/CD:
└── GitHub Actions → build Docker → push to ECR → App Runner auto-deploysRDS PostgreSQL setup (Phase 1)
A small but dedicated database instance. More than enough for 1-10 clinics and 100k patients.
Instance: db.t4g.medium (2 vCPU, 4 GB RAM)
Engine: PostgreSQL 17
Storage: 50 GB gp3 (3,000 IOPS baseline, auto-expand enabled)
Multi-AZ: Enabled (automatic failover to standby in another data center)
Encryption at rest: Enabled (AES-256, AWS-managed key)
Encryption in transit: Enabled (TLS required)
Backup retention: 7 days (automated, continuous)
PITR: Enabled (restore to any second in the last 7 days)
Public access: Disabled (VPC-private only)
Parameter Group (custom):
max_connections: 200
shared_buffers: 1GB
effective_cache_size: 3GB
work_mem: 32MB
maintenance_work_mem: 256MB
Monitoring:
Enhanced Monitoring: Enabled (1-minute granularity)
Performance Insights: Enabled (free tier, 7-day retention)
Cost:
Instance (db.t4g.medium): ~$55/month
Storage (50 GB gp3): ~$8/month
Backups: ~$5/month
Total RDS: ~$68/monthWhy db.t4g.medium for Phase 1:
- 2 vCPU, 4 GB RAM is more than enough for 50-100 concurrent connections
- Burstable — uses CPU credits during quiet periods, bursts for peak load
- Multi-AZ gives automatic failover even on the smallest instance
- Can resize to db.t4g.large (8 GB) or db.r6g.large (16 GB) later with minimal downtime
Connection math (Phase 1):
Core API: 3 instances × 20 pool = 60 connections
Telemetry API: 2 instances × 15 pool = 30 connections
Background jobs + monitoring: 10 connections
Total: ~100 connections
max_connections: 200 (50% headroom)ElastiCache Redis setup (Phase 1)
Instance: cache.t4g.micro (2 vCPU, 0.5 GB)
Engine: Redis 7
Multi-AZ: No (Redis data is ephemeral — booking holds, rate limits, idempotency keys)
Encryption in transit: Enabled
Encryption at rest: Enabled
VPC: restartix-prod (same as RDS)
Security Group: restartix-redis
Cost: ~$12/monthApp Runner service configuration
Core API:
Service: restartix-core-api
Source: ECR image (auto-deploy on new image push)
Instance:
CPU: 1 vCPU
Memory: 2 GB
Auto-scaling:
Min instances: 1
Max instances: 5
Max concurrency: 100 # requests per instance before scaling up
Max request timeout: 30s
Health check:
Path: /health
Protocol: HTTP
Interval: 10s
Timeout: 5s
Healthy threshold: 1
Unhealthy threshold: 5
Networking:
VPC Connector: restartix-vpc-connector
Environment variables:
DATABASE_URL: (from Secrets Manager — RDS private endpoint)
REDIS_URL: (from Secrets Manager — ElastiCache private endpoint)
CLERK_SECRET_KEY: (from Secrets Manager)
CLERK_WEBHOOK_SECRET: (from Secrets Manager)
S3_BUCKET: restartix-uploads-prod
S3_REGION: eu-central-1
DAILY_API_KEY: (from Secrets Manager)
ENCRYPTION_KEY: (from Secrets Manager)
APP_ENV: production
LOG_LEVEL: info
DB_POOL_MAX: 30
PORT: 9000Telemetry API:
Service: restartix-telemetry-api
Source: ECR image (auto-deploy on new image push)
Instance:
CPU: 0.5 vCPU
Memory: 1 GB
Auto-scaling:
Min instances: 1
Max instances: 3
Max concurrency: 200 # telemetry events are lightweight
Max request timeout: 10s
Health check:
Path: /health
Protocol: HTTP
Interval: 10s
Timeout: 5s
Networking:
VPC Connector: restartix-vpc-connector
Environment variables:
DATABASE_URL: (from Secrets Manager — RDS private endpoint)
CLICKHOUSE_URL: (from Secrets Manager)
APP_ENV: production
LOG_LEVEL: info
PORT: 4000CI/CD pipeline: GitHub Actions
After this is set up, deploying is git push to main — identical to the Railway workflow.
# .github/workflows/deploy.yml
name: Build and Deploy to AWS
on:
push:
branches: [main]
env:
AWS_REGION: eu-central-1
ECR_REGISTRY: <account-id>.dkr.ecr.eu-central-1.amazonaws.com
jobs:
deploy-api:
name: Deploy Core API
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push Core API image
run: |
docker build -t restartix-core-api -f Dockerfile.api .
docker tag restartix-core-api:latest $ECR_REGISTRY/restartix-core-api:latest
docker tag restartix-core-api:latest $ECR_REGISTRY/restartix-core-api:${{ github.sha }}
docker push $ECR_REGISTRY/restartix-core-api:latest
docker push $ECR_REGISTRY/restartix-core-api:${{ github.sha }}
# App Runner auto-deploys when a new image is pushed to ECR.
# No additional step needed.
deploy-telemetry:
name: Deploy Telemetry API
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push Telemetry API image
run: |
docker build -t restartix-telemetry-api -f Dockerfile.telemetry .
docker tag restartix-telemetry-api:latest $ECR_REGISTRY/restartix-telemetry-api:latest
docker tag restartix-telemetry-api:latest $ECR_REGISTRY/restartix-telemetry-api:${{ github.sha }}
docker push $ECR_REGISTRY/restartix-telemetry-api:latest
docker push $ECR_REGISTRY/restartix-telemetry-api:${{ github.sha }}What this does:
- You push code to
main - GitHub Actions builds Docker images for both services
- Images are pushed to ECR (AWS container registry)
- App Runner detects the new image and auto-deploys with zero downtime
- Old instances drain, new instances start — no manual intervention
Secrets management
All sensitive values move from Railway environment variables to AWS Secrets Manager.
Secret: restartix/production/core-api
Values:
DATABASE_URL: postgres://restartix:[email protected]:5432/restartix?sslmode=require
REDIS_URL: rediss://restartix-redis.xxx.euc1.cache.amazonaws.com:6379
CLERK_SECRET_KEY: sk_live_xxx
CLERK_WEBHOOK_SECRET: whsec_xxx
DAILY_API_KEY: xxx
ENCRYPTION_KEY: xxx
BACKUP_ENCRYPTION_KEY: xxx
Secret: restartix/production/telemetry-api
Values:
DATABASE_URL: postgres://restartix:[email protected]:5432/restartix?sslmode=require
CLICKHOUSE_URL: https://xxx.clickhouse.cloud:8443
Cost: $0.40/secret/month + $0.05 per 10,000 API calls
Total: ~$1-2/monthWhy Secrets Manager instead of plain environment variables:
- Secrets are encrypted at rest and in transit
- Auditable — every access is logged in CloudTrail
- Rotatable — can rotate keys without redeploying
- HIPAA compliant — required for healthcare
- One place to manage all secrets across services
Custom domains and SSL
Setup:
- App Runner provides a default URL:
https://xxx.eu-central-1.awsapprunner.com - Add your custom domain in App Runner console (e.g.,
api.restartix.com) - App Runner gives you a CNAME record
- Add the CNAME in Cloudflare DNS
- SSL is handled automatically by both Cloudflare and App Runner
Cloudflare configuration stays the same:
- DDoS protection
- WAF + OWASP rulesets
- Edge rate limiting
- TLS termination
- No changes needed
Monitoring and observability
App Runner sends logs and metrics to CloudWatch automatically. No setup needed.
Logs:
CloudWatch Log Groups:
/aws/apprunner/restartix-core-api/application → Application logs (slog JSON)
/aws/apprunner/restartix-core-api/service → App Runner system logs
/aws/apprunner/restartix-telemetry-api/application → Telemetry logs
/aws/apprunner/restartix-telemetry-api/service → Telemetry system logsKey alarms to set up:
Alarm: Core API High Error Rate
Metric: 5xx count / total requests
Threshold: > 1% for 5 minutes
Action: SNS notification → email/Slack
Alarm: Core API High Latency
Metric: p99 response time
Threshold: > 1 second for 5 minutes
Action: SNS notification → email/Slack
Alarm: Core API Unhealthy
Metric: Health check failures
Threshold: > 3 consecutive failures
Action: SNS notification → email/Slack (critical)
Alarm: RDS High Connection Count
Metric: DatabaseConnections
Threshold: > 160 (80% of max_connections)
Action: SNS notification → email/Slack
Alarm: RDS High CPU
Metric: CPUUtilization
Threshold: > 80% for 10 minutes
Action: SNS notification → email
Alarm: RDS Low Free Storage
Metric: FreeStorageSpace
Threshold: < 10 GB
Action: SNS notification → email
Alarm: Monthly Spend Approaching Budget
Metric: AWS estimated charges
Threshold: > $200/month (adjust as needed)
Action: SNS notification → emailCloudWatch dashboard (create one):
- Request count per minute (both services)
- Error rate (5xx / total)
- Response time (p50, p95, p99)
- Active instances count
- CPU and memory utilization per instance
- RDS: connections, CPU, free storage, read/write IOPS
- ElastiCache: memory usage, connections, cache hits/misses
Phase 1 cost estimate
App Runner:
Core API (1 vCPU, 2 GB, 1-5 instances): ~$25-40/month
Telemetry API (0.5 vCPU, 1 GB, 1-3 inst): ~$10-20/month
RDS PostgreSQL:
db.t4g.medium (Multi-AZ): ~$55/month
Storage (50 GB gp3): ~$8/month
Automated backups: ~$5/month
ElastiCache Redis:
cache.t4g.micro: ~$12/month
Networking:
NAT Gateway: ~$35/month
Data transfer: ~$5/month
Other:
Secrets Manager: ~$2/month
ECR: ~$1/month
CloudWatch: ~$5/month
──────────
AWS Total: ~$163-188/month
External services (unchanged):
S3 uploads: ~$10-50/month
Cloudflare: Free
Clerk: $200/month
Daily.co: ~$100/month
──────────
GRAND TOTAL: ~$473-538/monthCompared to Railway + Neon (~$475-565/month), this is the same cost or cheaper — and you get:
- 99.99% SLA on everything
- HIPAA BAA on everything (for free)
- Database not exposed to the internet
- Automated failover (Multi-AZ)
- 7-day continuous PITR with 5-minute RPO
- No connection limit ceiling
- Dedicated database resources
- No vendor migration later
Phase 2 architecture (months 12-24)
When read/write split is needed to handle more concurrent connections.
What changes from Phase 1
Phase 1: App Runner → VPC → RDS (single primary, Multi-AZ)
Phase 2: App Runner → VPC → RDS Primary (writes) + 2 Read Replicas (reads)New infrastructure
The VPC stays the same. You just add read replicas.
RDS Primary (existing, upgraded):
Instance: db.r6g.large (2 vCPU, 16 GB RAM) # upgrade from t4g.medium
Storage: 250 GB gp3
RDS Read Replica 1:
Instance: db.r6g.large
Same AZ as subnet B
RDS Read Replica 2:
Instance: db.r6g.large
Same AZ as subnet A (or add subnet C)
ElastiCache Redis (upgraded):
Instance: cache.t4g.small (2 vCPU, 1.37 GB)Connection routing
Mutations (POST/PUT/PATCH/DELETE) go to the primary. Reads (GET) go round-robin across replicas. This is handled in application middleware — no infrastructure change needed.
Phase 2 cost estimate
App Runner (Core API + Telemetry): ~$50-70/month
RDS primary (db.r6g.large): ~$200/month
RDS 2 read replicas: ~$400/month
RDS storage (250 GB): ~$25/month
RDS backups: ~$10/month
ElastiCache Redis: ~$25/month
NAT Gateway: ~$35/month
Other (Secrets, ECR, CW): ~$10/month
──────────────
AWS Total: ~$755-795/month
External services:
S3 uploads: ~$50-100/month
Cloudflare: Free or $20/month
Clerk: $200/month
Daily.co: ~$200/month
──────────────
GRAND TOTAL: ~$1,205-1,315/monthPhase 3 architecture (months 24-36)
Two-tier system on AWS
Shared tier (90 small/medium clinics):
- 1 App Runner service (Core API, scales to 10+ instances)
- 1 RDS cluster (primary + 2 replicas, db.r6g.xlarge)
- 1 ElastiCache Redis
Enterprise tier (10 large clinics, each gets dedicated infrastructure):
- 1 App Runner service per enterprise clinic
- 1 RDS instance per enterprise clinic (db.r6g.large)
- Automated provisioning via script
Shared Tier Enterprise Tier
┌─────────────────────┐ ┌─────────────────────┐
│ App Runner: the Core API │ │ App Runner: org-101 │
│ (10 instances) │ │ (2 instances) │
│ │ │ │ │ │
│ ┌────▼────┐ │ │ ┌────▼────┐ │
│ │ RDS │ │ │ │ RDS │ │
│ │ xlarge │ │ │ │ large │ │
│ │ + 2 rep │ │ │ │ │ │
│ └─────────┘ │ │ └─────────┘ │
└─────────────────────┘ └─────────────────────┘
┌─────────────────────┐
│ App Runner: org-102 │
│ ... │
└─────────────────────┘
(repeat for each enterprise org)All enterprise infrastructure lives in the same VPC — different security groups isolate each tenant's database.
Automated enterprise provisioning
When a new enterprise clinic signs up, a script provisions their entire stack:
Provisioning creates:
1. RDS instance (db.r6g.large, encrypted, Multi-AZ) in existing VPC
2. Security group (only this tenant's App Runner service can connect)
3. App Runner service (pointing to shared ECR image, with VPC Connector)
4. Secrets Manager entry (connection strings)
5. CloudWatch alarms
6. Routing table entry (tenant_shards)
7. Run database migrations
Provisioning time: ~15-20 minutes (RDS creation is the bottleneck)
Can be triggered by: Admin API endpoint or CLI commandPhase 3 cost estimate
Shared Tier:
App Runner (10 instances): ~$100/month
RDS xlarge + 2 replicas: ~$1,200/month
ElastiCache: ~$25/month
Shared Total: ~$1,325/month
Enterprise Tier (10 orgs):
App Runner per org: ~$25/month
RDS large per org: ~$200/month
Per-org cost: ~$225/month
Enterprise Total: 10 × $225 = $2,250/month
NAT Gateway + networking: ~$50/month
Other (Secrets, ECR, CW): ~$20/month
AWS Infrastructure Total: ~$3,645/month
Revenue:
90 shared × $150/month: $13,500/month
10 enterprise × $1,500/month: $15,000/month
Total Revenue: $28,500/month
Gross Margin: 87% ($24,855/month)Phase 4 architecture (months 36+)
Multi-region on AWS
AWS makes this straightforward because App Runner and RDS are available in every major region.
┌──────────────────────────────────────────────┐
│ EU Region (eu-central-1, Frankfurt) │
│ │
│ VPC: restartix-eu │
│ Shared Shards: EU-1, EU-2, EU-3 │
│ Enterprise: 10 dedicated projects │
│ Why: GDPR data residency for EU clinics │
└──────────────────────────────────────────────┘
┌──────────────────────────────────────────────┐
│ US Region (us-east-1, Virginia) │
│ │
│ VPC: restartix-us │
│ Shared Shards: US-1, US-2, US-3, US-4, US-5│
│ Enterprise: 30 dedicated projects │
│ Why: US clinics, lowest latency │
└──────────────────────────────────────────────┘
Global Routing:
DynamoDB Global Table (replicated across regions)
├── organization_id → region + shard assignment
├── Cached in Redis (1 min TTL)
└── Cloudflare Workers routes to correct regionPhase 4 cost estimate
Shared Tier (8 shards across 2 regions):
8 × App Runner + RDS: ~$3,600/month
Enterprise Tier (50 dedicated):
50 × $225: ~$11,250/month
Global routing (DynamoDB): ~$50/month
Cross-region data transfer: ~$100/month
NAT Gateways (2 regions): ~$70/month
AWS Infrastructure Total: ~$15,070/month
Revenue:
150 shared × $150: $22,500/month
50 enterprise × $1,500: $75,000/month
Total Revenue: $97,500/month
Gross Margin: 85% ($82,430/month)Cost summary across all phases
| Phase | Clinics | AWS infra | External services | Total monthly | vs Railway+Neon path |
|---|---|---|---|---|---|
| 1 | 1-10 | ~$163-188 | ~$310-350 | ~$473-538 | Same cost, way better guarantees |
| 2 | 10-50 | ~$755-795 | ~$450-520 | ~$1,205-1,315 | ~$200 more, but no Neon migration needed |
| 3 | 50-100 | ~$3,645 | ~$500 | ~$4,145 | Similar |
| 4 | 100-1000+ | ~$15,070 | ~$500 | ~$15,570 | Similar |
Phase 1 is the same cost as the Railway+Neon path. The difference: you get 99.99% SLA, HIPAA BAA, private networking, and zero provider migrations in the future. At Phase 2+, costs are comparable because both paths use similar RDS infrastructure.
AWS account setup guide
This section is for a solo developer who is not a DevOps engineer. Every step is explicit.
Step 1: AWS account
If you already have an AWS account (you do — for S3), skip to Step 2.
Otherwise:
- Go to aws.amazon.com → Create an AWS Account
- Use a business email, not personal
- Add a payment method
- Select the "Business" support plan ($29/month — worth it for production)
Step 2: Secure the account
1. Enable MFA on root account:
AWS Console → IAM → Security credentials → Assign MFA device
2. Create an IAM user for daily work (never use root):
IAM → Users → Create user
Name: your-name-admin
Attach policy: AdministratorAccess
Enable console access + programmatic access
3. Create a deploy user (for GitHub Actions):
IAM → Users → Create user
Name: github-actions-deploy
Attach policies:
- AmazonEC2ContainerRegistryPowerUser (push images to ECR)
- AWSAppRunnerFullAccess (manage App Runner)
Programmatic access only (no console)
Save the Access Key ID and Secret Access Key → add to GitHub SecretsStep 3: Create the VPC
Follow the "Setup: creating the VPC" instructions in the VPC section above. One wizard, one click, done.
Step 4: Create RDS PostgreSQL
AWS Console → RDS → Create database
Settings:
Engine: PostgreSQL 17
Template: Production
Instance: db.t4g.medium
Storage: 50 GB gp3, enable auto-scaling (max 200 GB)
Multi-AZ: Yes
VPC: restartix-prod
Subnet group: Create new (select both private subnets)
Public access: No
Security group: restartix-rds
Database name: restartix
Master username: restartix_admin
Master password: (generate a strong one, save in Secrets Manager)
Backup retention: 7 days
Encryption: Enabled
Enhanced monitoring: Enabled
Performance Insights: Enabled
Click "Create database" → wait ~10 minutesStep 5: Create ElastiCache Redis
AWS Console → ElastiCache → Create Redis cluster
Settings:
Cluster mode: Disabled
Node type: cache.t4g.micro
Number of replicas: 0 (Phase 1, ephemeral data)
Subnet group: Create new (select private subnets from restartix-prod VPC)
Security group: restartix-redis
Encryption in transit: Yes
Encryption at rest: Yes
Click "Create" → wait ~5 minutesStep 6: Create ECR repositories
AWS Console → ECR → Create repository
Repository 1: restartix-core-api
- Visibility: Private
- Image tag mutability: Mutable
- Encryption: AES-256
Repository 2: restartix-telemetry-api
- Same settings
Enable lifecycle policy (clean up old images):
Rule: Delete untagged images older than 30 days
Rule: Keep only last 10 tagged imagesStep 7: Create App Runner services
AWS Console → App Runner → Create service
Service 1: restartix-core-api
Source: Container registry → Amazon ECR
Image: <account-id>.dkr.ecr.eu-central-1.amazonaws.com/restartix-core-api:latest
Deployment: Automatic (deploy on new image push)
Port: 8080
Instance configuration:
CPU: 1 vCPU
Memory: 2 GB
Auto scaling:
Min instances: 1
Max instances: 5
Max concurrency: 100
Health check:
Path: /health
Protocol: HTTP
Networking:
VPC Connector: restartix-vpc-connector (created in Step 3)
Environment variables:
(Add all from the "App Runner service configuration" section above)
For secrets: Use "Reference a secret" → select from Secrets Manager
Service 2: restartix-telemetry-api
Same process, different image, port (4000), and smaller instance (0.5 vCPU, 1 GB)Step 8: Set up custom domain
In App Runner service → Custom domains → Link domain
Domain: api.restartix.com
App Runner provides:
CNAME record: xxx.acm-validations.aws
CNAME target: xxx.eu-central-1.awsapprunner.com
In Cloudflare DNS:
Add CNAME record:
Name: api
Target: (the value from App Runner)
Proxy: Yes (orange cloud)Step 9: Set up monitoring and billing alerts
AWS Console → CloudWatch → Alarms → Create alarm
Create the alarms listed in the "Monitoring and observability" section.
For notifications:
SNS → Create topic: restartix-alerts
Add subscription: your email
(Optional: Add Slack webhook via AWS Chatbot)
AWS Console → Billing → Budgets → Create budget
Budget 1: Monthly spend
Amount: $200 (adjust for your phase)
Alert at: 80% and 100%
Notify: your emailDatabase migration: Neon to RDS
Before migration
- Set up the full AWS infrastructure (Steps 1-9 above)
- Verify App Runner services are healthy on the default AWS URLs
- Run database migrations on the new RDS instance
- Schedule a maintenance window (communicate to clinics — expect ~30 minutes)
Migration steps
1. Put application in maintenance mode (return 503 for all requests)
2. Dump the Neon database:
pg_dump $NEON_DATABASE_URL --format=custom --no-owner > restartix-dump.pgdump
3. Restore to RDS:
pg_restore --host=restartix-prod.xxx.rds.amazonaws.com \
--username=restartix_admin \
--dbname=restartix \
--no-owner \
--verbose \
restartix-dump.pgdump
4. Verify row counts match:
Run SELECT count(*) FROM <table> on both databases for key tables
5. Update Secrets Manager:
Change DATABASE_URL to the RDS endpoint
6. Restart App Runner services (they pick up the new secret)
7. Switch Cloudflare DNS to point to App Runner (if not already done)
8. Remove maintenance mode
9. Test critical flows: login, create appointment, view patients
10. Monitor for 24 hoursRollback plan
If anything goes wrong:
- Change DATABASE_URL back to Neon in Secrets Manager
- Restart App Runner services
- Traffic returns to Neon within minutes
- Investigate at your own pace
Keep Neon running for 7 days after migration. Then cancel.
Backup strategy on AWS
The existing Backup & Disaster Recovery strategy remains the same with RDS replacing Neon as the primary database.
What changes
| Layer | Before (Neon) | After (RDS) |
|---|---|---|
| Layer 0: Live DB | Neon serverless | RDS db.t4g.medium (dedicated) |
| Layer 1: Vendor backups | Neon PITR (7-30 days depending on plan) | RDS automated backups (7-day PITR, 5-minute RPO) |
| Layer 2: Daily backups | pg_dump → S3 | Same — pg_dump → S3 (vendor independence) |
| Layer 3: Cross-region | S3 replication | Same — S3 cross-region replication |
| Layer 4: Offline | Physical media | Same |
RDS backup advantages
- Automated continuous backup — AWS handles it, runs in the background
- Point-in-time recovery — restore to any second in the last 7 days
- Recovery point objective — 5 minutes (vs 24 hours with daily pg_dump alone)
- Automated snapshots — daily, retained for 7 days
- Manual snapshots — before migrations or risky changes, retained indefinitely
- Cross-region replication — native, for disaster recovery in Phase 4
- Multi-AZ — standby replica auto-promoted if primary fails
Daily pg_dump to S3 (Layer 2) continues as an independent safety layer — vendor independence matters even when the vendor is AWS.
Security and compliance on AWS
HIPAA compliance
1. Enable HIPAA eligibility on AWS account:
AWS Console → AWS Artifact → Accept the AWS BAA
(This is free — just a legal agreement)
2. HIPAA-eligible services we use:
✅ App Runner
✅ ECR
✅ RDS
✅ ElastiCache
✅ S3
✅ Secrets Manager
✅ CloudWatch
✅ IAM
✅ VPC
3. Encryption requirements (all met):
✅ Data at rest: RDS (AES-256), ElastiCache (AES-256), S3 (SSE-S3)
✅ Data in transit: TLS 1.2+ everywhere (enforced by security groups + config)
✅ Secrets: AWS Secrets Manager (AES-256)
✅ Database: VPC-private, no public accessIAM roles and least privilege
Role: AppRunnerInstanceRole
Used by: App Runner services
Permissions:
- secretsmanager:GetSecretValue (restartix/production/*)
- s3:PutObject, s3:GetObject (restartix-uploads-prod/*)
- ecr:GetDownloadUrlForLayer, ecr:BatchGetImage
Role: GitHubActionsDeployRole
Used by: GitHub Actions CI/CD
Permissions:
- ecr:PutImage, ecr:InitiateLayerUpload, ecr:CompleteLayerUpload
- apprunner:UpdateService (if manual trigger needed)
Role: BackupJobRole
Used by: Backup automation (Lambda or cron)
Permissions:
- s3:PutObject (restartix-backups-primary/*)
- secretsmanager:GetSecretValue (backup encryption key only)
- rds:CreateDBSnapshot (for manual pre-migration snapshots)
Principle: Each role can only do exactly what it needs. Nothing more.Network security
Security Group: restartix-rds
Inbound:
- Port 5432 from restartix-apprunner-connector security group only
- No public access. Not from your laptop. Not from anywhere else.
Outbound:
- None needed
Security Group: restartix-redis
Inbound:
- Port 6379 from restartix-apprunner-connector security group only
Outbound:
- None needed
Security Group: restartix-apprunner-connector
Inbound:
- None (App Runner initiates connections, doesn't receive them here)
Outbound:
- Port 5432 to restartix-rds (database)
- Port 6379 to restartix-redis (cache)
- Port 443 to 0.0.0.0/0 (HTTPS to Clerk, Daily.co, S3, etc. via NAT Gateway)
Result:
Database and Redis are completely invisible to the internet.
Only your App Runner services can reach them.
Your App Runner services can reach external APIs through the NAT Gateway.What you manage day-to-day
After the one-time setup, here is everything you need to do on an ongoing basis:
| Task | How | Frequency |
|---|---|---|
| Deploy code | git push to main | Whenever you ship |
| Check if deploy succeeded | GitHub Actions tab or CloudWatch | After each push |
| View application logs | CloudWatch → Log groups | When debugging |
| Check service health | App Runner console → service status | Glance weekly |
| Check database health | RDS → Performance Insights | Glance weekly |
| Review AWS bill | Billing dashboard | Monthly |
| Respond to alarms | Email/Slack notification → investigate | When they fire |
| Rotate secrets | Secrets Manager → update value | Yearly (or when compromised) |
| Update Docker base image | Change 1 line in Dockerfile, git push | Every few months |
| Resize RDS instance | Console → Modify → pick larger instance | When Phase 2 triggers hit |
What you never do:
- Patch servers (App Runner is serverless, RDS is managed)
- Renew SSL certificates (automatic)
- Scale App Runner instances up or down (auto-scaling)
- Manage load balancers (App Runner handles it)
- Run database backups (RDS automated backups)
- Manage VPC/networking (set once, never touch again)
Related documentation
- Scaling Plan — Growth phases and trigger thresholds
- Backup & Recovery — Full backup strategy with 3-2-1-1 rule
- Key Decisions — Why Go, why PostgreSQL, why AWS
- Monitoring — Alerting and incident response
- Scaling Architecture — Connection math and database optimization