Scaling Architecture & Infrastructure Strategy
Overview
RestartiX Platform is designed for multi-tenant healthcare SaaS. This document outlines the infrastructure evolution from Phase 1 (few organizations, 10k patients each) through Phase 4 (1000+ organizations, millions of patients).
Key Constraint: Our RLS (Row-Level Security) pattern requires holding a dedicated PostgreSQL connection for the entire request duration. This is the correct approach for security but creates connection pool pressure that shapes our scaling strategy.
Core Architecture Pattern
Request Flow with RLS
1. Request arrives → ClerkAuth middleware validates session
2. OrganizationContext middleware:
- Acquires dedicated connection from pool
- Sets RLS session variables: app.current_user_id, app.current_org_id, app.current_role
- Stores connection in request context
3. All queries use this connection (via database.ConnFromContext)
4. Connection released when request completesWhy this matters for scaling:
- Connection held for entire request duration (not just query time)
- Slow request = connection blocked for seconds
DB_POOL_MAXmust be >= peak concurrent requests- PostgreSQL
max_connectionslimit becomes critical bottleneck
Phase 1: Single Shared Instance (1-10 Organizations)
Target Profile
- Organizations: 1-10
- Total patients: 10k-100k
- Database size: 10-50GB
- Peak concurrent requests: 50-100
- Timeline: Launch → First 12 months
Infrastructure
AWS App Runner Services
├── Core API
│ ├── Instances: 1-3 (auto-scale on concurrent requests)
│ ├── RAM per instance: 512MB-1GB
│ └── Connection pool: DB_POOL_MAX=30
│
├── Telemetry API
│ ├── Instances: 1-2
│ ├── RAM per instance: 512MB
│ └── Connection pool: DB_POOL_MAX=20
│
├── AWS RDS PostgreSQL
│ ├── Instance: db.t4g.medium (2 vCPU, 4GB RAM)
│ ├── Storage: 50GB gp3
│ ├── Max connections: 200
│ ├── Backups: Automated daily, 7-day retention + PITR
│ └── HIPAA: BAA available (enable in AWS console)
│
├── AWS ElastiCache Redis
│ ├── Instance: cache.t4g.micro (1GB RAM)
│ └── Usage: Rate limiting, sessions, webhook idempotency
│
└── VPC + VPC Connector
└── App Runner connects to RDS/Redis via private VPC
Cloudflare (Edge)
├── DDoS protection
├── WAF + OWASP rulesets
├── TLS termination
└── Edge rate limiting: 1000 req/min per IP
AWS S3
├── Bucket: restartix-uploads-prod
├── Encryption: SSE-S3
└── Lifecycle: Archive to Glacier after 1 yearFor the full AWS infrastructure setup, see AWS Infrastructure.
Connection Math
Total PostgreSQL connections needed:
- Core API: 3 replicas × 20 pool = 60 connections
- Telemetry API: 2 replicas × 15 pool = 30 connections
- Background jobs: 5 connections
- PgAdmin/monitoring: 5 connections
────────────────────────────────────────────────
Total: 100 connections
PostgreSQL max_connections: 200
Headroom: 100 connections (50%)Expected Performance
| Metric | Target | Alert Threshold |
|---|---|---|
| API response time (p99) | < 500ms | > 1s |
| Database active connections | 50-100 | > 160 (80% of max) |
| Connection pool wait count | 0 | > 0 for 1+ min |
| Database size | < 50GB | > 80GB |
| Error rate (5xx) | < 0.1% | > 1% |
RDS Configuration (Production)
Critical settings:
# 1. Enable automated backups with PITR
aws rds modify-db-instance \
--db-instance-identifier restartix-prod \
--backup-retention-period 7
# 2. Enable encryption at rest (set at creation)
# --storage-encrypted flag on create-db-instance
# 3. Enable Multi-AZ for high availability
aws rds modify-db-instance \
--db-instance-identifier restartix-prod \
--multi-az
# 4. Verify connection via VPC Connector
# App Runner services connect to RDS through the private VPCHIPAA Compliance:
# Enable BAA via AWS console (free, just enable in account settings)
# AWS RDS encrypts at rest by default when --storage-encrypted is set
# No additional enterprise plan needed for BAACost Estimate
AWS App Runner:
- Core API (3 instances): ~$45/mo
- Telemetry API (2 instances): ~$30/mo
App Runner Total: ~$75/mo
AWS RDS PostgreSQL:
- db.t4g.medium instance: ~$60/mo
- Storage (50GB gp3): ~$10/mo
- Automated backups: included
RDS Total: ~$70/mo
AWS ElastiCache Redis:
- cache.t4g.micro: ~$15/mo
AWS Secrets Manager: ~$5/mo
AWS S3: ~$10-50/mo (depends on upload volume)
Cloudflare: Free tier (sufficient for Phase 1)
Clerk: $200/mo (Enterprise with HIPAA BAA)
Daily.co: ~$100/mo
TOTAL: ~$475-565/month
Cost per patient: ~$5-56/patient/year (depends on org count)When to Graduate to Phase 2
Hard limits (must upgrade):
- ✅ PostgreSQL active connections consistently > 150
- ✅ Database size > 100GB
- ✅ API p99 latency > 1s
- ✅ Any organization > 50k patients
Soft signals (plan upgrade):
- ✅ More than 10 organizations
- ✅ Enterprise customer requiring SLA guarantees
- ✅ Connection pool wait events occurring
Phase 2: Shared Tier + Read Replicas (10-50 Organizations)
Target Profile
- Organizations: 10-50
- Total patients: 100k-500k
- Database size: 50-250GB
- Peak concurrent requests: 200-500
- Timeline: Months 12-24
Key Change: Add Read Replicas
Problem: Write queries (mutations) + read queries (GET requests) compete for the same connection pool.
Solution: Route read-only queries to read replicas.
Infrastructure Evolution: RDS Read Replicas
AWS App Runner Services (Core API + Telemetry)
↓ (via VPC Connector)
AWS RDS PostgreSQL
├── Primary (read-write)
│ ├── Instance: db.r6g.large (2 vCPU, 16GB RAM)
│ ├── Storage: 250GB gp3 (10k IOPS)
│ ├── max_connections: 500
│ └── Endpoint: restartix-prod.xxxxx.rds.amazonaws.com
│
├── Read Replica 1 (read-only, same spec as primary)
│ └── Endpoint: restartix-prod-replica1.xxxxx.rds.amazonaws.com
│
└── Read Replica 2 (read-only, same spec as primary)
└── Endpoint: restartix-prod-replica2.xxxxx.rds.amazonaws.com
Connection Routing:
- Mutations (POST/PUT/PATCH/DELETE) → Primary
- Queries (GET) → Round-robin across read replicas
- RLS still works (session vars set on all connections)
Cost: ~$600-800/moRead Replica Routing Middleware
// internal/middleware/read_replica.go
type DatabaseRouter struct {
primary *pgxpool.Pool
replicas []*pgxpool.Pool
rr uint32 // round-robin counter (atomic)
}
func (dr *DatabaseRouter) AcquireConnection(ctx context.Context, r *http.Request) (*pgxpool.Conn, error) {
// Mutations always use primary
if r.Method != http.MethodGet {
return dr.primary.Acquire(ctx)
}
// Reads use round-robin across replicas
idx := atomic.AddUint32(&dr.rr, 1) % uint32(len(dr.replicas))
return dr.replicas[idx].Acquire(ctx)
}Connection Math (with replicas)
Primary (write path):
- Mutations: ~30% of requests
- Core API: 3 replicas × 10 pool = 30 connections
- Telemetry API: 2 replicas × 5 pool = 10 connections
Total primary: 40 connections
Read Replica 1 + 2 (read path):
- Queries: ~70% of requests
- Core API: 3 replicas × 10 pool × 2 replicas = 60 connections
- Telemetry API: 2 replicas × 10 pool × 2 replicas = 40 connections
Total per replica: 50 connections
max_connections: 500 (plenty of headroom)Setup Procedure: Add RDS Read Replicas
# Step 1: Create read replicas via AWS CLI
aws rds create-db-instance-read-replica \
--db-instance-identifier restartix-prod-replica1 \
--source-db-instance-identifier restartix-prod
aws rds create-db-instance-read-replica \
--db-instance-identifier restartix-prod-replica2 \
--source-db-instance-identifier restartix-prod
# Step 2: Wait for replicas to become available
aws rds wait db-instance-available \
--db-instance-identifier restartix-prod-replica1
aws rds wait db-instance-available \
--db-instance-identifier restartix-prod-replica2
# Step 3: Get connection endpoints
aws rds describe-db-instances \
--db-instance-identifier restartix-prod \
--query 'DBInstances[0].Endpoint'
aws rds describe-db-instances \
--db-instance-identifier restartix-prod-replica1 \
--query 'DBInstances[0].Endpoint'
aws rds describe-db-instances \
--db-instance-identifier restartix-prod-replica2 \
--query 'DBInstances[0].Endpoint'
# Step 4: Update environment variables in AWS Secrets Manager
aws secretsmanager update-secret \
--secret-id restartix-prod/env \
--secret-string '{"DATABASE_URL":"postgres://restartix_admin:[email protected]/restartix","DATABASE_REPLICA_1_URL":"postgres://restartix_admin:[email protected]/restartix","DATABASE_REPLICA_2_URL":"postgres://restartix_admin:[email protected]/restartix",...}'
# Step 5: Deploy with read replica support
git push origin main # GitHub Actions handles ECR build + App Runner deploy
# Step 6: Verify replication lag < 1s
# On read replica:
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;
# Expected: < 1 secondCost Estimate
AWS App Runner:
- Core API (5 instances): $75/mo
- Telemetry API (3 instances): $45/mo
App Runner Total: $120/mo
AWS RDS PostgreSQL:
- Primary (db.r6g.large): $200/mo
- Read Replica 1: $200/mo
- Read Replica 2: $200/mo
- Backup storage (50GB): $10/mo
RDS Total: $610/mo
AWS ElastiCache Redis: ~$30/mo
AWS Secrets Manager: ~$5/mo
AWS S3: ~$50-100/mo
Cloudflare: Free or $20/mo (Pro)
Clerk: $200/mo
Daily.co: ~$200/mo (more orgs)
TOTAL: ~$1,235-1,385/month
Cost per patient: ~$2-3/patient/year (500k patients)When to Graduate to Phase 3
Hard limits:
- ✅ More than 50 organizations
- ✅ Database size > 500GB
- ✅ Any single organization > 100k patients
- ✅ Enterprise customer requesting dedicated infrastructure
Soft signals:
- ✅ Primary database CPU consistently > 70%
- ✅ Replication lag > 5 seconds
- ✅ Customer complaints about "noisy neighbor" performance
Phase 3: Hybrid Model (50-100 Organizations)
Target Profile
- Shared tier: 90 small/medium organizations (< 10k patients each)
- Enterprise tier: 10 large organizations (50k+ patients each)
- Total patients: 500k-1M
- Timeline: Months 24-36
Architecture: Two-Tier System
┌─────────────────────────────────────────────────────────────┐
│ SHARED TIER (90 organizations) │
├─────────────────────────────────────────────────────────────┤
│ AWS App Runner: restartix-shared │
│ ├── Core API (5 instances) │
│ ├── Telemetry API (3 instances) │
│ └── ElastiCache Redis │
│ │
│ AWS RDS: restartix-shared-cluster │
│ ├── Primary (db.r6g.xlarge: 4 vCPU, 32GB RAM) │
│ ├── Read Replica 1 │
│ └── Read Replica 2 │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ ENTERPRISE TIER (10 organizations, dedicated services) │
├─────────────────────────────────────────────────────────────┤
│ AWS App Runner: restartix-org-101 │
│ ├── Core API (2 instances) │
│ ├── Telemetry API (1 instance) │
│ └── ElastiCache Redis (dedicated) │
│ │
│ AWS RDS: restartix-org-101 │
│ └── Primary (db.r6g.large: 2 vCPU, 16GB RAM) │
│ │
│ [Repeat for each enterprise org...] │
└─────────────────────────────────────────────────────────────┘Tenant Assignment Logic
Shared Tier (default):
- Organizations with < 50k patients
- No SLA guarantees (best-effort)
- Pricing: $99-199/month per org
Enterprise Tier (opt-in):
- Organizations with > 50k patients OR
- Organizations requiring custom SLA OR
- Regulatory requirement for data isolation (rare)
- Pricing: $999-2,999/month per org
Tenant Routing Table
-- Stored in a separate "routing database" (tiny PostgreSQL instance)
CREATE TABLE tenant_shards (
organization_id BIGINT PRIMARY KEY,
shard_type TEXT NOT NULL, -- 'shared' or 'enterprise'
shard_id TEXT NOT NULL, -- 'shared' or 'org-101', 'org-102', etc.
-- Connection details
database_primary_url TEXT NOT NULL,
database_replica_urls TEXT[], -- Array of replica URLs
-- Metadata
max_patients INT,
sla_tier TEXT, -- 'standard', 'enterprise'
billing_tier TEXT, -- 'shared-99', 'enterprise-999', etc.
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Example data
INSERT INTO tenant_shards (organization_id, shard_type, shard_id, database_primary_url) VALUES
(1, 'shared', 'shared', 'postgres://...restartix-shared-cluster...'),
(2, 'shared', 'shared', 'postgres://...restartix-shared-cluster...'),
-- ... 90 orgs on shared
(101, 'enterprise', 'org-101', 'postgres://...restartix-org-101...'),
(102, 'enterprise', 'org-102', 'postgres://...restartix-org-102...');Routing Middleware
// internal/middleware/tenant_router.go
type TenantRouter struct {
routingDB *pgxpool.Pool // Tiny DB with tenant_shards table
pools map[string]*DatabaseRouter // shard_id → pool(s)
mu sync.RWMutex
}
func (tr *TenantRouter) GetPool(ctx context.Context, orgID int64) (*DatabaseRouter, error) {
// 1. Lookup shard assignment (cached in Redis, TTL 5 min)
cacheKey := fmt.Sprintf("shard:%d", orgID)
if cached, err := tr.redis.Get(ctx, cacheKey).Result(); err == nil {
return tr.pools[cached], nil
}
// 2. Query routing database
var shard struct {
ShardID string
DatabasePrimaryURL string
DatabaseReplicaURLs []string
}
err := tr.routingDB.QueryRow(ctx, `
SELECT shard_id, database_primary_url, database_replica_urls
FROM tenant_shards
WHERE organization_id = $1
`, orgID).Scan(&shard.ShardID, &shard.DatabasePrimaryURL, &shard.DatabaseReplicaURLs)
if err != nil {
return nil, fmt.Errorf("org %d not found in routing table", orgID)
}
// 3. Cache for 5 minutes
tr.redis.Set(ctx, cacheKey, shard.ShardID, 5*time.Minute)
// 4. Return existing pool or create new one
tr.mu.RLock()
pool, exists := tr.pools[shard.ShardID]
tr.mu.RUnlock()
if !exists {
pool = tr.createPoolForShard(shard)
tr.mu.Lock()
tr.pools[shard.ShardID] = pool
tr.mu.Unlock()
}
return pool, nil
}
// Updated OrganizationContext middleware
func OrganizationContext(router *TenantRouter) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
user := UserFromContext(r.Context())
orgID := user.CurrentOrganizationID
// Get the correct database pool for this org
dbRouter, err := router.GetPool(r.Context(), orgID)
if err != nil {
slog.Error("failed to route tenant", "org_id", orgID, "error", err)
httputil.InternalError(w)
return
}
// Acquire connection from the correct shard
conn, err := dbRouter.AcquireConnection(r.Context(), r)
if err != nil {
slog.Error("failed to acquire connection", "error", err)
httputil.InternalError(w)
return
}
defer conn.Release()
// Set RLS context as before
database.SetRLSContext(r.Context(), conn, user.ID, orgID, user.Role)
ctx := context.WithValue(r.Context(), dbConnKey, conn)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
}Automated Enterprise Provisioning
// internal/jobs/provision_enterprise.go
func ProvisionEnterpriseOrg(ctx context.Context, orgID int64) error {
// 1. Provision AWS RDS instance
rdsInstance := rds.CreateDBInstance(&rds.CreateDBInstanceInput{
DBInstanceIdentifier: aws.String(fmt.Sprintf("restartix-org-%d", orgID)),
DBInstanceClass: aws.String("db.r6g.large"),
Engine: aws.String("postgres"),
EngineVersion: aws.String("16.2"),
AllocatedStorage: aws.Int64(250),
StorageEncrypted: aws.Bool(true),
MultiAZ: aws.Bool(true),
})
// 2. Wait for RDS to be available
rds.WaitUntilDBInstanceAvailable(/* ... */)
// 3. Run migrations
runMigrations(rdsInstance.Endpoint)
// 4. Store secrets in AWS Secrets Manager
secretsManager.CreateSecret(&secretsmanager.CreateSecretInput{
Name: aws.String(fmt.Sprintf("restartix-org-%d/env", orgID)),
SecretString: aws.String(fmt.Sprintf(`{"ORGANIZATION_ID":"%d","DATABASE_URL":"%s"}`, orgID, rdsInstance.Endpoint.Address)),
})
// 5. Update tenant_shards table
routingDB.Exec(ctx, `
INSERT INTO tenant_shards (organization_id, shard_type, shard_id, database_primary_url)
VALUES ($1, 'enterprise', $2, $3)
`, orgID, fmt.Sprintf("org-%d", orgID), rdsInstance.Endpoint.Address)
// 6. Create App Runner services for the Core API + Telemetry
appRunner.CreateService(&apprunner.CreateServiceInput{
ServiceName: aws.String(fmt.Sprintf("restartix-org-%d-api", orgID)),
// ... ECR image, VPC connector, etc.
})
return nil
}Cost Estimate
Shared Tier:
- App Runner (Core API + Telemetry): $120/mo
- AWS RDS (db.r6g.xlarge + 2 replicas): $1,200/mo
- ElastiCache Redis: $30/mo
Shared Total: $1,350/mo
Revenue: 90 orgs × $150/mo = $13,500/mo
Margin: $12,150/mo (90%)
Enterprise Tier (10 orgs):
- App Runner per org: $50/mo
- AWS RDS per org: $200/mo
- ElastiCache Redis per org: $15/mo
Per-org cost: $265/mo
Total: 10 × $265 = $2,650/mo
Revenue: 10 orgs × $1,500/mo = $15,000/mo
Margin: $12,350/mo (82%)
COMBINED TOTAL: $4,000/mo infrastructure cost
COMBINED REVENUE: $28,500/mo
GROSS MARGIN: $24,500/mo (86%)When to Graduate to Phase 4
Hard limits:
- ✅ More than 100 organizations on shared tier
- ✅ Shared tier database size > 1TB
- ✅ More than 20 enterprise organizations (manual provisioning too slow)
Soft signals:
- ✅ Enterprise provisioning takes > 2 hours (needs automation)
- ✅ Support burden from shared tier "noisy neighbor" issues
- ✅ Regulatory requirement for geographic data residency (EU vs US shards)
Phase 4: Multi-Shard with Geographic Distribution (100-1000 Organizations)
Target Profile
- Shared tiers: 5-10 regional shards (20-50 orgs each)
- Enterprise tier: 50-100 dedicated projects
- Total patients: 1M-10M
- Geographic regions: US, EU, possibly APAC
- Timeline: Months 36+
Architecture: Geographic Sharding
┌──────────────────────────────────────────────────────────┐
│ US REGION │
├──────────────────────────────────────────────────────────┤
│ Shared Shard: US-1 (50 orgs) │
│ Shared Shard: US-2 (50 orgs) │
│ Enterprise: 30 dedicated projects │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ EU REGION (GDPR data residency) │
├──────────────────────────────────────────────────────────┤
│ Shared Shard: EU-1 (30 orgs) │
│ Enterprise: 10 dedicated projects │
└──────────────────────────────────────────────────────────┘
Global Routing Table (DynamoDB or managed PostgreSQL)
├── Organization → Shard assignment
├── Shard → Database endpoints
└── Cached in Redis (1 min TTL)Shard Assignment Strategy
Factors:
- Geographic preference (GDPR compliance)
- Organization size (large orgs → dedicated)
- Load balancing (distribute evenly across shared shards)
-- Enhanced tenant_shards table
ALTER TABLE tenant_shards ADD COLUMN region TEXT; -- 'us-east-1', 'eu-west-1'
ALTER TABLE tenant_shards ADD COLUMN shard_load_score INT; -- For load balancing
-- Shard assignment algorithm
CREATE OR REPLACE FUNCTION assign_shard(
p_org_id BIGINT,
p_region TEXT,
p_estimated_patients INT
) RETURNS TEXT AS $$
DECLARE
v_shard_id TEXT;
BEGIN
-- Enterprise tier if > 50k patients
IF p_estimated_patients > 50000 THEN
v_shard_id := 'org-' || p_org_id;
-- Trigger provisioning job
PERFORM pg_notify('provision_enterprise', p_org_id::text);
RETURN v_shard_id;
END IF;
-- Shared tier: find least-loaded shard in region
SELECT shard_id INTO v_shard_id
FROM shard_capacity
WHERE region = p_region
AND shard_type = 'shared'
AND current_orgs < max_orgs
AND current_patients + p_estimated_patients <= max_patients
ORDER BY current_orgs ASC
LIMIT 1;
-- If no capacity, create new shared shard
IF v_shard_id IS NULL THEN
v_shard_id := p_region || '-' || (
SELECT COUNT(*) + 1 FROM tenant_shards
WHERE region = p_region AND shard_type = 'shared'
);
PERFORM pg_notify('provision_shared_shard', v_shard_id);
END IF;
RETURN v_shard_id;
END;
$$ LANGUAGE plpgsql;Cross-Shard Operations
Problem: Some queries need to aggregate across shards (analytics, superadmin dashboards).
Solutions:
Telemetry Service Aggregation
- Telemetry collects audit events from all shards
- ClickHouse aggregates across organizations
- Superadmin dashboard queries Telemetry, not individual shards
Federated Queries (when needed)
go// internal/superadmin/cross_shard_query.go func GetAllOrganizations(ctx context.Context) ([]*Organization, error) { shards := getAllShards() // Query routing DB results := make([]*Organization, 0) errChan := make(chan error, len(shards)) resultChan := make(chan []*Organization, len(shards)) // Fan-out: query all shards in parallel for _, shard := range shards { go func(s Shard) { orgs, err := s.Pool.Query(ctx, "SELECT * FROM organizations") if err != nil { errChan <- err return } resultChan <- orgs }(shard) } // Fan-in: collect results for i := 0; i < len(shards); i++ { select { case orgs := <-resultChan: results = append(results, orgs...) case err := <-errChan: return nil, err } } return results, nil }Materialized Views in Routing DB
- Nightly ETL job aggregates key metrics from all shards
- Routing DB has
organization_summarytable - Fast reads, eventual consistency (good enough for dashboards)
Cost Estimate (100 orgs shared, 50 enterprise)
Shared Tier (5 shards × 20 orgs each):
- App Runner (5 service sets): 5 × $120 = $600/mo
- AWS RDS (5 × db.r6g.xlarge): 5 × $400 = $2,000/mo
- ElastiCache Redis (5 clusters): 5 × $30 = $150/mo
Shared Total: $2,750/mo
Revenue: 100 orgs × $150/mo = $15,000/mo
Enterprise Tier (50 dedicated service sets):
- Infrastructure: 50 × $265 = $13,250/mo
Revenue: 50 orgs × $1,500/mo = $75,000/mo
COMBINED TOTAL: $16,000/mo infrastructure cost
COMBINED REVENUE: $90,000/mo
GROSS MARGIN: $74,000/mo (82%)Cross-Shard Patient Identity
The problem
patient_persons is the patient's portable profile — it has no organization_id. When every shard is scoped to organizations, where does this table live? And when a patient registered at Clinic A (Shard 1) walks into Clinic B (Shard 2), how does Shard 2 find them?
Architecture: Global Patient Registry + Home Shard
The solution splits into two parts:
- Patient data stays in a "home shard" — whichever shard the patient first registered in
- A lightweight registry in the routing DB enables cross-shard discovery
┌──────────────────────────────────────────────────────────────────────┐
│ ROUTING / GLOBAL DB │
│ (already exists for tenant_shards) │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ tenant_shards patient_person_registry │
│ ┌──────────────────────┐ ┌───────────────────────────────┐ │
│ │ organization_id (PK) │ │ patient_person_id (PK) │ │
│ │ shard_id │ │ user_id (unique, nullable) │ │
│ │ database_primary_url │ │ phone_hash (indexed) │ │
│ │ ... │ │ email_hash (indexed, nullable)│ │
│ └──────────────────────┘ │ home_shard_id │ │
│ │ created_at │ │
│ └───────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────┐ ┌─────────────────────────────────┐
│ SHARD 1 │ │ SHARD 2 │
├─────────────────────────────┤ ├─────────────────────────────────┤
│ │ │ │
│ patient_persons │ │ patient_persons │
│ ┌───────────────────────┐ │ │ ┌───────────────────────┐ │
│ │ id: 99 │ │ │ │ (empty — patient 99 │ │
│ │ name: "Maria López" │ │ │ │ lives in Shard 1) │ │
│ │ dob, blood_type, ... │ │ │ └───────────────────────┘ │
│ └───────────────────────┘ │ │ │
│ │ │ patients (org-link) │
│ patients (org-link) │ │ ┌───────────────────────┐ │
│ ┌───────────────────────┐ │ │ │ org_id: Clinic B │ │
│ │ org_id: Clinic A │ │ │ │ patient_person_id: 99 │ │
│ │ patient_person_id: 99 │ │ │ │ profile_shared: false │ │
│ │ profile_shared: true │ │ │ └───────────────────────┘ │
│ └───────────────────────┘ │ │ │
│ │ │ appointments, forms, files... │
│ appointments, forms, ... │ │ (Clinic B's clinical records) │
│ (Clinic A's clinical data) │ │ │
└─────────────────────────────┘ └─────────────────────────────────┘Key principle: The full patient_persons row stays in the home shard. Other shards reference it by ID and fetch on demand. Clinical records (appointments, forms, files) always stay local to the shard where they were created.
Global patient registry schema
-- Stored in the routing database (same DB as tenant_shards)
-- This is a lightweight lookup table, NOT the full patient profile
CREATE TABLE patient_person_registry (
patient_person_id BIGINT PRIMARY KEY,
user_id BIGINT UNIQUE, -- for login-based lookup
phone_hash TEXT, -- SHA-256 of normalized phone
email_hash TEXT, -- SHA-256 of email (if available)
home_shard_id TEXT NOT NULL, -- shard where patient_persons row lives
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_patient_registry_phone ON patient_person_registry (phone_hash);
CREATE INDEX idx_patient_registry_email ON patient_person_registry (email_hash)
WHERE email_hash IS NOT NULL;
CREATE INDEX idx_patient_registry_user ON patient_person_registry (user_id)
WHERE user_id IS NOT NULL;This table stores hashes only — no PII. It answers one question: "Does this patient exist, and which shard holds their profile?"
Cross-shard registration flow
When a patient registered at Clinic A (Shard 1) registers at Clinic B (Shard 2):
Patient walks into Clinic B, gives phone number
│
├── 1. DISCOVER: Query routing DB
│ SELECT home_shard_id, patient_person_id
│ FROM patient_person_registry
│ WHERE phone_hash = sha256(normalize(phone))
│ → Found: patient_person_id=99, home_shard=shard_1
│
├── 2. FETCH: Read portable profile from home shard
│ Shard 1 → SELECT name, date_of_birth, blood_type, ...
│ FROM patient_persons WHERE id = 99
│ → Single row by PK (sub-millisecond)
│
├── 3. LINK: Create org-patient link in Shard 2
│ INSERT INTO patients (organization_id, patient_person_id, profile_shared)
│ VALUES (clinic_b_id, 99, false)
│ → Consent pending, Clinic B sees name only
│
└── 4. CONSENT: Normal consent flow
Patient signs consent form at Clinic B
→ profile_shared = true
→ Clinic B staff now sees full portable profileImplementation
// internal/patients/cross_shard.go
// DiscoverPatient checks the global registry for an existing patient identity.
// Returns the patient_person_id and home shard, or empty if new patient.
func DiscoverPatient(ctx context.Context, routingDB *pgxpool.Pool, phone string) (*PatientRef, error) {
phoneHash := sha256Hex(normalizePhone(phone))
var ref PatientRef
err := routingDB.QueryRow(ctx, `
SELECT patient_person_id, home_shard_id
FROM patient_person_registry
WHERE phone_hash = $1
`, phoneHash).Scan(&ref.PatientPersonID, &ref.HomeShardID)
if errors.Is(err, pgx.ErrNoRows) {
return nil, nil // New patient — no existing identity
}
return &ref, err
}
// FetchPortableProfile reads the patient_persons row from the home shard.
// Used when a patient registers at a clinic on a different shard.
func FetchPortableProfile(ctx context.Context, shards *TenantRouter, ref *PatientRef) (*PatientPerson, error) {
pool, err := shards.GetPoolByShard(ctx, ref.HomeShardID)
if err != nil {
return nil, fmt.Errorf("resolve home shard %s: %w", ref.HomeShardID, err)
}
var p PatientPerson
err = pool.QueryRow(ctx, `
SELECT id, name, date_of_birth, sex, blood_type, allergies, chronic_conditions,
emergency_contact_name, insurance_entries
FROM patient_persons
WHERE id = $1
`, ref.PatientPersonID).Scan(
&p.ID, &p.Name, &p.DateOfBirth, &p.Sex, &p.BloodType, &p.Allergies,
&p.ChronicConditions, &p.EmergencyContactName, &p.InsuranceEntries,
)
return &p, err
}
// RegisterPatientAtOrg creates the org-patient link in the current shard
// and registers in the global registry if this is a new patient.
func RegisterPatientAtOrg(ctx context.Context, routingDB *pgxpool.Pool, localConn *pgxpool.Conn, req RegisterRequest) error {
ref, err := DiscoverPatient(ctx, routingDB, req.Phone)
if err != nil {
return err
}
if ref == nil {
// New patient — create patient_persons in current shard + register globally
var personID int64
err = localConn.QueryRow(ctx, `
INSERT INTO patient_persons (name, date_of_birth, sex, phone_encrypted)
VALUES ($1, $2, $3, $4)
RETURNING id
`, req.Name, req.DOB, req.Sex, encrypt(req.Phone)).Scan(&personID)
if err != nil {
return err
}
// Register in global lookup
_, err = routingDB.Exec(ctx, `
INSERT INTO patient_person_registry (patient_person_id, phone_hash, home_shard_id)
VALUES ($1, $2, $3)
`, personID, sha256Hex(normalizePhone(req.Phone)), req.CurrentShardID)
if err != nil {
return err
}
ref = &PatientRef{PatientPersonID: personID, HomeShardID: req.CurrentShardID}
}
// Create org-patient link in current shard
_, err = localConn.Exec(ctx, `
INSERT INTO patients (organization_id, patient_person_id, profile_shared)
VALUES ($1, $2, false)
`, req.OrgID, ref.PatientPersonID)
return err
}When to introduce this
| Phase | What changes |
|---|---|
| 1-2 | Nothing. Single database, patient_persons and patients are in the same DB. The two-table split already exists for the consent model. |
| 3 | Add patient_person_registry to the routing DB. Populate it from existing patient_persons rows (one-time migration). Cross-shard reads only needed for enterprise-tier orgs on separate shards. |
| 4 | Same pattern, more shards. Add Redis cache (5 min TTL) for registry lookups and fetched profiles if latency matters. |
Profile reads at the application layer
When Clinic B (Shard 2) needs to display a patient whose profile lives in Shard 1, the read path is:
GET /v1/patients/:id
│
├── 1. Local query (Shard 2): patients + patient_person_id
│
├── 2. Is patient_person_id in this shard's patient_persons table?
│ YES → local join (fast, normal path)
│ NO → cross-shard fetch via home_shard_id from registry
│
└── 3. Apply profile_shared gate as usual
→ profile_shared=false: return name only
→ profile_shared=true: return full profileThe cross-shard fetch is a single row by primary key — sub-millisecond on the remote shard. For frequently accessed profiles, cache in Redis with a short TTL (5 minutes). Invalidate on patient profile update via a pub/sub event from the home shard.
What does NOT cross shards
Clinical records never leave their shard. This is enforced architecturally, not just by policy:
- Appointments created at Clinic B → stored in Shard 2, scoped to Clinic B's
organization_id - Forms filled at Clinic B → stored in Shard 2
- Files uploaded at Clinic B → stored in Shard 2's S3 prefix
- Treatment plans from Clinic B → stored in Shard 2
Shard 1 has no mechanism to query Shard 2's clinical records, and vice versa. The only cross-shard data flow is the portable profile read described above.
Database Optimization Strategies
1. Audit Log Partitioning (Critical at Scale)
Problem: audit_log table grows to billions of rows. Queries slow down, backups take hours.
Solution: Monthly partitions with automated archival.
-- Migration: Convert audit_log to partitioned table
CREATE TABLE audit_log_new (
id BIGSERIAL,
organization_id BIGINT,
user_id BIGINT,
action TEXT,
entity_type TEXT,
entity_id BIGINT,
changes JSONB,
ip_address INET,
user_agent TEXT,
request_path TEXT,
status_code INT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Partitioning columns
action_context TEXT,
break_glass_id BIGINT,
impersonation_id BIGINT,
PRIMARY KEY (id, created_at) -- Include partition key in PK
) PARTITION BY RANGE (created_at);
-- Create partitions for current + next 3 months
CREATE TABLE audit_log_2026_02 PARTITION OF audit_log_new
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE audit_log_2026_03 PARTITION OF audit_log_new
FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');
-- Automated partition management
CREATE OR REPLACE FUNCTION create_audit_log_partition()
RETURNS void AS $$
DECLARE
v_next_month DATE := date_trunc('month', CURRENT_DATE + interval '1 month');
v_partition_name TEXT := 'audit_log_' || to_char(v_next_month, 'YYYY_MM');
v_start_date TEXT := v_next_month::text;
v_end_date TEXT := (v_next_month + interval '1 month')::text;
BEGIN
EXECUTE format(
'CREATE TABLE IF NOT EXISTS %I PARTITION OF audit_log_new
FOR VALUES FROM (%L) TO (%L)',
v_partition_name, v_start_date, v_end_date
);
END;
$$ LANGUAGE plpgsql;
-- Run monthly via cron
SELECT cron.schedule('create-audit-partition', '0 0 1 * *', 'SELECT create_audit_log_partition()');2. Index Optimization
-- Composite indexes for common query patterns
CREATE INDEX idx_audit_org_created ON audit_log_new (organization_id, created_at DESC);
CREATE INDEX idx_audit_entity ON audit_log_new (entity_type, entity_id, created_at DESC);
CREATE INDEX idx_audit_user ON audit_log_new (user_id, created_at DESC);
-- Partial indexes for break-glass and impersonation (rare events)
CREATE INDEX idx_audit_break_glass ON audit_log_new (break_glass_id)
WHERE break_glass_id IS NOT NULL;
-- GIN index for JSONB changes column (if querying change contents)
CREATE INDEX idx_audit_changes ON audit_log_new USING GIN (changes);3. Connection Pool Tuning
// internal/config/config.go
type PoolConfig struct {
// Phase 1: Conservative
MinConns int `default:"2"`
MaxConns int `default:"20"`
// Phase 2+: Aggressive (with read replicas)
// MinConns int `default:"5"`
// MaxConns int `default:"50"`
MaxConnLifetime time.Duration `default:"1h"`
MaxConnIdleTime time.Duration `default:"30m"`
// New: Health check interval
HealthCheckPeriod time.Duration `default:"1m"`
}
// internal/database/postgres.go
func NewPool(ctx context.Context, cfg *config.PoolConfig) (*pgxpool.Pool, error) {
config, err := pgxpool.ParseConfig(cfg.DatabaseURL)
if err != nil {
return nil, err
}
config.MinConns = int32(cfg.MinConns)
config.MaxConns = int32(cfg.MaxConns)
config.MaxConnLifetime = cfg.MaxConnLifetime
config.MaxConnIdleTime = cfg.MaxConnIdleTime
config.HealthCheckPeriod = cfg.HealthCheckPeriod
// Connection pool metrics (export to Telemetry)
config.AfterConnect = func(ctx context.Context, conn *pgx.Conn) error {
slog.Debug("connection acquired", "remote_addr", conn.PgConn().RemoteAddr())
return nil
}
config.AfterRelease = func(conn *pgx.Conn) bool {
// Return true to destroy connection, false to return to pool
// Destroy if idle > MaxConnIdleTime
return false
}
return pgxpool.NewWithConfig(ctx, config)
}4. Query Performance Monitoring
// internal/middleware/query_logger.go
func QueryPerformanceMonitor(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap connection to intercept queries
conn := database.ConnFromContext(r.Context())
if conn != nil {
tracer := &queryTracer{requestID: middleware.RequestIDFromContext(r.Context())}
conn.Config().Tracer = tracer
}
next.ServeHTTP(w, r)
duration := time.Since(start)
if duration > 1*time.Second {
slog.Warn("slow request",
"path", r.URL.Path,
"method", r.Method,
"duration_ms", duration.Milliseconds(),
"request_id", middleware.RequestIDFromContext(r.Context()),
)
}
})
}
type queryTracer struct {
requestID string
}
func (qt *queryTracer) TraceQueryStart(ctx context.Context, conn *pgx.Conn, data pgx.TraceQueryStartData) context.Context {
return context.WithValue(ctx, "query_start", time.Now())
}
func (qt *queryTracer) TraceQueryEnd(ctx context.Context, conn *pgx.Conn, data pgx.TraceQueryEndData) {
start := ctx.Value("query_start").(time.Time)
duration := time.Since(start)
if duration > 500*time.Millisecond {
slog.Warn("slow query",
"request_id", qt.requestID,
"duration_ms", duration.Milliseconds(),
"sql", data.SQL[:min(200, len(data.SQL))], // Truncate long queries
)
}
}Monitoring & Observability
Critical Metrics to Track
| Metric | Source | Alert Threshold | Phase |
|---|---|---|---|
| Connection pool wait count | pool.Stat().WaitCount | > 0 for 1+ min | All |
| Active DB connections | pool.Stat().AcquiredConns() | > 80% of max | All |
| Query latency (p99) | Query tracer | > 500ms | All |
| Request latency (p99) | Middleware | > 1s | All |
| Database size | pg_database_size() | > 80% of allocated | All |
| Replication lag | pg_last_xact_replay_timestamp() | > 5s | Phase 2+ |
| Error rate (5xx) | HTTP middleware | > 1% | All |
| Shard imbalance | Routing DB | Max shard load > 2× min | Phase 3+ |
Observability Stack
Application Logs (slog JSON)
↓
CloudWatch Logs
↓
CloudWatch / Datadog / Grafana Loki
↓
Dashboards + AlertsExample Datadog Monitors
# datadog-monitors.yaml
- name: "PostgreSQL Connection Pool Exhaustion"
type: metric alert
query: "avg(last_5m):avg:postgres.pool.wait_count{env:prod} > 10"
message: |
Connection pool is experiencing waits. Immediate action required.
Troubleshooting:
1. Check active connections: SELECT count(*) FROM pg_stat_activity
2. Check long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '30 seconds'
3. Consider scaling DB replicas or increasing pool size
escalation:
- "@pagerduty-engineering"
- "@slack-alerts"
- name: "Slow Query Detection"
type: log alert
query: 'logs("slow query").rollup("count").last("5m") > 100'
message: |
High volume of slow queries detected (> 500ms).
Check query tracer logs for offending queries.
May need to add indexes or optimize query patterns.
- name: "Database Size Growth"
type: metric alert
query: "avg(last_1h):avg:postgres.database_size{env:prod} > 800000000000" # 800GB
message: |
Database approaching 1TB. Plan migration to larger instance or sharding.Migration Checklists
Phase 1 → Phase 2 (Add Read Replicas)
Pre-migration:
- [ ] Verify RDS primary instance health and backups
- [ ] Estimate downtime window (typically minimal for adding replicas)
- [ ] Schedule maintenance window (communicate to customers)
Migration:
- [ ] Create read replicas:
aws rds create-db-instance-read-replica - [ ] Wait for replication to catch up
- [ ] Update AWS Secrets Manager with primary + replica URLs
- [ ] Deploy the Core API + Telemetry with read replica support (
git push origin main) - [ ] Verify queries routing to replicas: check RDS connection count
- [ ] Monitor for 24 hours
Rollback plan:
- [ ] If issues detected, revert environment variables to single primary URL
- [ ] Re-deploy without read replica code
Post-migration:
- [ ] Verify replication lag < 1s
- [ ] Update disaster recovery runbook
Phase 2 → Phase 3 (Add Enterprise Tier)
Pre-migration:
- [ ] Implement tenant routing middleware
- [ ] Create routing database (separate PostgreSQL instance)
- [ ] Test enterprise provisioning script (dry-run)
- [ ] Identify first enterprise customer (largest org)
Migration:
- [ ] Provision first enterprise project (manually)
- [ ] Migrate customer data:
pg_dump --schema=public --table=appointments --table=patients ... | psql - [ ] Update
tenant_shardsrouting table - [ ] Deploy tenant routing middleware to production
- [ ] Verify customer routed to enterprise shard
- [ ] Monitor for 48 hours
- [ ] Automate provisioning script
- [ ] Migrate remaining enterprise customers (1 per week)
Post-migration:
- [ ] Document enterprise provisioning runbook
- [ ] Train support team on tier differences
- [ ] Update billing system to track shard assignments
Cost Optimization Strategies
1. Reserved Instances (20-30% savings)
# AWS RDS Reserved Instances (1-year or 3-year commitment)
# Example: db.r6g.large, 1-year, all upfront
# On-demand: $200/mo → Reserved: $140/mo (30% savings)
aws rds purchase-reserved-db-instances-offering \
--reserved-db-instances-offering-id <offering-id> \
--reserved-db-instance-id restartix-reserved-1 \
--db-instance-count 32. Storage Tiering
# Hot data (0-12 months): gp3 SSD (fast, expensive)
# Warm data (1-6 years): S3 JSONL archives (cheap)
# Cold data (6+ years): S3 Glacier (very cheap)
# S3 Lifecycle Policy
{
"Rules": [
{
"Id": "Archive audit logs",
"Prefix": "audit-archive/",
"Status": "Enabled",
"Transitions": [
{
"Days": 365,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 2190 # 6 years
}
}
]
}3. Autoscaling Core API Replicas
# App Runner autoscaling config (set via AWS CLI or console)
aws apprunner update-service \
--service-arn <service-arn> \
--auto-scaling-configuration-arn <config-arn>
# Create autoscaling configuration
aws apprunner create-auto-scaling-configuration \
--auto-scaling-configuration-name restartix-platform-scaling \
--max-concurrency 100 \
--min-size 2 \
--max-size 104. Database Connection Pooler (External)
Not recommended for RLS-heavy apps, but worth mentioning:
- PgBouncer in session mode (preserves RLS session variables)
- Deploy as sidecar container or use RDS Proxy
- Reduces actual PostgreSQL connections by 50-70%
- Tradeoff: Adds 1-5ms latency per query
Core API (100 connections) → PgBouncer (50 connections) → PostgreSQLOur verdict: Skip for now. RLS requires connection-scoped state, and PgBouncer session mode doesn't offer enough savings to justify the complexity.
Disaster Recovery
Backup Strategy
| Tier | Method | Frequency | Retention | RTO | RPO |
|---|---|---|---|---|---|
| Phase 1 (RDS) | RDS automated backups | Continuous WAL | 7 days PITR | 15 min | 5 min |
| Phase 2+ (RDS) | RDS automated backups | Continuous WAL | 7 days PITR | 15 min | 5 min |
| All phases | Manual pre-migration backups | Before each deploy | Permanent (S3) | 2 hours | 0 (planned) |
Recovery Procedures
Scenario 1: Accidental data deletion
-- RDS Point-in-Time Recovery
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier restartix-prod \
--target-db-instance-identifier restartix-prod-recovered \
--restore-time 2026-02-13T14:30:00ZScenario 2: Database corruption
# Restore from latest snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier restartix-prod-restored \
--db-snapshot-identifier rds:restartix-prod-2026-02-13-06-00
# Update DNS or connection string to point to restored instanceScenario 3: Complete region failure
- Cross-region RDS read replica (Phase 4)
- Promote replica to primary
- Update Cloudflare DNS
- RTO: 30 minutes, RPO: 5 minutes
Summary: Scaling Decision Tree
START: How many organizations?
├─ 1-10 orgs
│ └─ Phase 1: Single App Runner instance ✅
│
├─ 10-50 orgs
│ ├─ Is DB size > 100GB? YES → Phase 2 (Read replicas)
│ ├─ Are connections maxing out? YES → Phase 2
│ └─ Otherwise → Stay in Phase 1, monitor closely
│
├─ 50-100 orgs
│ ├─ Any org > 50k patients? YES → Phase 3 (Hybrid model)
│ ├─ Enterprise SLA required? YES → Phase 3
│ └─ Otherwise → Phase 2 with larger RDS instance
│
└─ 100+ orgs
└─ Phase 4: Multi-shard + geographic distribution
Special cases:
├─ GDPR data residency required? → Phase 4 (regional shards)
├─ Customer willing to pay $1000+/mo? → Enterprise tier (any phase)
└─ Regulatory requirement for isolation? → Enterprise tierNext Steps
Immediate (Phase 1 - this week)
- ✅ Implement connection pool monitoring (see next document)
- ✅ Add query timeout middleware
- ✅ Set up Datadog/CloudWatch alerts for connection pool metrics
- ✅ Document current connection pool configuration
Short-term (Phase 1 - next 3 months)
- Monitor metrics weekly
- Tune
DB_POOL_MAXbased on actual usage - Plan Phase 2 migration when hitting trigger thresholds
- Set up automated audit log partitioning
Long-term (Phase 2+)
- Scale RDS when > 10 orgs or > 100GB
- Implement tenant routing for enterprise tier
- Build automated provisioning scripts
- Plan geographic expansion (EU shard)