Skip to content

Monitoring & Observability

Production-grade observability strategy for the RestartiX platform on AWS ECS Fargate + Cloudflare. The launch stack uses CloudWatch (logs + metrics + alarms) + Sentry (error tracking) + Cloudflare analytics; Datadog and similar APMs are deferred — see aws-infrastructure.md for the broader infrastructure context.

SQL is illustrative

SQL fragments in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migrations live in services/api/migrations/core/.

Why Monitoring Matters for This Architecture

RestartiX Platform uses Row-Level Security (RLS), which means every request holds a database connection for its entire duration. Unlike traditional connection pooling where connections are briefly borrowed and returned, our architecture requires careful monitoring to prevent connection pool exhaustion.

Critical risks without monitoring:

  • Connection pool exhaustion → cascading failures
  • Slow queries → blocked connections → no capacity for new requests
  • Replication lag (Phase 2+) → stale data reads
  • Silent performance degradation → poor user experience

Critical Metrics

Database Connection Pool

The most critical metric for this architecture. Connection pool exhaustion is the primary scaling bottleneck.

MetricTargetAlert ThresholdAction
Connection pool wait count0> 0 for 1+ minImmediate: Investigate long-running queries, consider scaling
Active DB connections< 80% of max> 80% of maxWarning: Monitor closely, plan capacity increase
Idle connections> 20% of pool< 10% of poolPool undersized or requests not releasing connections
Connection pool utilization< 70%> 80%Plan for read replicas (Phase 2) or connection tuning

Why this matters:

  • RLS requires holding a connection for the entire request
  • Pool exhaustion = all new requests fail immediately
  • No connection = no query = no response = 503 errors

Pool size by phase:

PhaseInfrastructureMax ConnectionsReserved for App
Phase 1AWS RDS PostgreSQL10090 (10 for monitoring/admin)
Phase 2AWS RDS (primary)200180 (20 reserved)
Phase 2AWS RDS (replicas, each)200190 (10 reserved)

Query Performance

MetricTargetAlert ThresholdPhase
Query latency (p50)< 50ms> 200msAll
Query latency (p95)< 200ms> 500msAll
Query latency (p99)< 500ms> 1sAll
Slow query count0> 10/minAll
Query timeout rate0%> 0.1%All

Slow query definition: Any query taking > 500ms

What to investigate:

  • Missing indexes on organization_id (required for RLS)
  • N+1 queries (multiple queries in loops)
  • Full table scans on large tables
  • Complex joins without proper indexing
  • Segment rule evaluation on large patient sets

Request Performance

MetricTargetAlert ThresholdPhase
Request latency (p50)< 100ms> 300msAll
Request latency (p95)< 300ms> 500msAll
Request latency (p99)< 500ms> 1sAll
Error rate (4xx)< 5%> 10%All
Error rate (5xx)< 0.1%> 1%All
Timeout rate< 0.1%> 1%All

Database Health

MetricTargetAlert ThresholdPhase
Database sizeN/A> 80% of allocatedAll
Replication lag< 1s> 5sPhase 2+
Replica healthAll healthyAny replica downPhase 2+
Dead tuples< 5%> 10%All

Application Health

MetricTargetAlert ThresholdPhase
CPU usage< 60%> 80%All
Memory usage< 70%> 85%All
Goroutines< 1000> 5000All
Heap allocations< 500MB> 1GBAll
GC pause time (p99)< 10ms> 50msAll

Feature-Specific Metrics

FeatureMetricTargetAlert Threshold
SegmentsRule evaluation time< 100ms> 500ms
SegmentsPatient count calculation< 200ms> 1s
WebhooksDelivery success rate> 95%< 90%
WebhooksDelivery latency (p95)< 5s> 30s
WebhooksRetry queue depth< 100> 500
FormsPDF generation time< 2s> 10s
DocumentsS3 upload time (p95)< 1s> 5s
SchedulingAvailability calculation< 200ms> 1s
SchedulingHold expiration accuracy100%< 99.9%
Audit (local)Write throughput (inserts to audit_log)> 1000/sec< 100/sec
Audit (local)Write latency (p95)< 5ms> 20ms
Audit (forwarding)Telemetry forwarding lag (received_at - created_at)< 30s> 5min
Audit (forwarding)Queue depth (Redis audit queue)< 100> 500
Audit (forwarding)Forward error rate< 0.1%> 1%

Service Level Objectives (SLOs)

API Availability

TierSLOMonthly Downtime Allowance
All tiers99.5%3h 37min

API Latency

Endpoint Categoryp95 Targetp99 Target
Read (simple)< 100ms< 200ms
Read (complex)< 300ms< 500ms
Write (simple)< 200ms< 400ms
Write (complex)< 500ms< 1s
Export/Report< 5s< 10s

Endpoint categories:

  • Simple read: GET single resource by ID
  • Complex read: List with filtering, joins, or segment evaluation
  • Simple write: Create/update single resource
  • Complex write: Multi-step operations (appointment booking, form submission with file uploads)
  • Export/Report: PDF generation, CSV export, analytics queries

Data Durability

Data TypeRPO (Recovery Point Objective)RTO (Recovery Time Objective)
Database (primary)5 minutes1 hour
Database (replica failover)0 (synchronous replication)5 minutes
File storage (S3)0 (11 9's durability)Immediate
Audit logs0 (dual write: local + Telemetry)Immediate

Observability Stack

Architecture

The launch stack is CloudWatch + Sentry + Cloudflare. Datadog and similar APMs are deferred — see aws-infrastructure.md → What we don't use and why for the rationale (CloudWatch alarms + Sentry error tracking is enough until traffic and team size justify the per-host APM bill).

Application Layer (Core API + Telemetry API + Next.js apps)
├─ Go slog / Node logger (structured JSON)
├─ pgx connection pool stats
├─ HTTP middleware metrics (request count, duration, errors)
└─ Custom business metrics (via CloudWatch PutMetricData)



CloudWatch (logs + metrics + alarms)
├─ Log groups per service (retention configured per env)
├─ Custom metrics namespace: RestartiX/{service}
└─ Alarms → SNS → AWS Chatbot → Slack / email



Sentry (error tracking)
├─ Go and Next.js error capture
├─ Release tracking (image SHA = release tag)
└─ Alert routing for unhandled errors



Cloudflare (edge observability)
├─ Request volume, cache hit ratio, geo distribution
├─ WAF block events, bot mitigation triggers
└─ Cloudflare for SaaS hostname health (custom-domain TLS status)

Future / deferred:

  • Datadog (or similar APM) — distributed tracing, anomaly detection. Trigger to add: when CloudWatch's basic alarms stop being sufficient or when team size makes the per-engineer APM cost justifiable.
  • AWS X-Ray — distributed tracing native to AWS. Cheaper than Datadog but requires SDK integration in every service.
  • Grafana / Prometheus self-hosted stack — explicitly rejected; running an observability platform is operational debt for a small team.

Implementation: Connection Pool Monitoring

File: internal/observability/pool_metrics.go

See immediate-actions.md for full implementation.

Key capabilities:

  • Logs pool stats every 30 seconds
  • Automatic alerts on utilization > 80%
  • Error logs when wait count > 0 (immediate attention required)
  • Exports metrics to CloudWatch via the cloudwatch.PutMetricData AWS SDK call

Usage:

go
// cmd/api/main.go
poolMetrics := observability.NewPoolMetrics(db, "primary")
poolMetrics.Start()
defer poolMetrics.Stop()

// For read replicas (Phase 2+)
for i, replica := range readReplicas {
    metrics := observability.NewPoolMetrics(replica, fmt.Sprintf("replica-%d", i+1))
    metrics.Start()
    defer metrics.Stop()
}

Implementation: Query Performance Tracing

File: internal/middleware/query_tracer.go

See immediate-actions.md for full implementation.

Key capabilities:

  • Logs all queries > 500ms with SQL preview
  • Critical alerts for queries > 5s
  • Tracks request context (request ID, path, user)
  • Integrates with pgx tracer hooks

Implementation: Request Timeout Middleware

File: internal/middleware/query_timeout.go

See immediate-actions.md for full implementation.

Timeout values:

  • Default: 30 seconds (all requests)
  • Long operations: 2 minutes (exports, PDF generation, complex reports)
  • Query-level: 5 seconds (individual queries)
  • Long queries: 30 seconds (analytics, aggregations)

Why this matters:

  • Prevents runaway queries from exhausting connection pool
  • Ensures requests fail fast instead of hanging indefinitely
  • Provides clear error messages to clients

Implementation: Health Checks

File: internal/health/handler.go

See immediate-actions.md for full implementation.

Health check endpoint: GET /health

Response format:

json
{
    "status": "healthy", // or "degraded", "unhealthy"
    "checks": {
        "postgresql": {
            "status": "healthy",
            "metrics": {
                "total_conns": 45,
                "acquired_conns": 30,
                "idle_conns": 15,
                "max_conns": 90,
                "utilization_pct": 50.0
            }
        },
        "redis": {
            "status": "healthy"
        }
    },
    "uptime_seconds": 86400
}

ALB target group health check configuration:

Each ECS service has its own ALB target group. Health checks are defined in Terraform on the target group:

hcl
resource "aws_lb_target_group" "core_api" {
  name        = "restartix-core-api-${var.env}"
  port        = 9000
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = var.vpc_id

  health_check {
    enabled             = true
    path                = "/health"
    protocol            = "HTTP"
    interval            = 10
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Health check is intentionally shallow/health confirms the process is alive and the connection pool has at least one healthy connection. A deeper /health/ready endpoint that pings every dependency is available for monitoring tools (CloudWatch synthetic checks) but is not used by the ALB, because flaky dependencies would unnecessarily roll otherwise-healthy tasks.


Datadog Dashboards (deferred — for reference)

Status: deferred. The launch stack uses CloudWatch alarms + Sentry — see Observability Stack → Architecture. The dashboards documented in this section are kept as an implementation reference for the day Datadog (or a similar APM) is added — the metric names and widget shapes carry over directly. Until then, the equivalent CloudWatch dashboards in the next section are the source of truth.

Dashboard 1: Database Performance

Widgets:

  1. Connection Pool Utilization (time series)

    • Metric: postgres.pool.utilization_pct
    • Alert line at 80%
    • By pool name (primary, replica-1, replica-2)
  2. Connection Pool Wait Count (time series)

    • Metric: postgres.pool.wait_count
    • Alert line at 0 (any waits = problem)
  3. Active vs Idle Connections (stacked area)

    • Metrics: postgres.pool.acquired_conns, postgres.pool.idle_conns
    • Shows pool usage distribution
  4. Query Latency Distribution (heatmap)

    • Metric: query.duration_ms
    • Buckets: 0-50ms, 50-100ms, 100-200ms, 200-500ms, 500ms-1s, 1s+
  5. Slow Query Count (time series)

    • Metric: query.slow_count (queries > 500ms)
    • Alert line at 10/min
  6. Replication Lag (time series, Phase 2+)

    • Metric: postgres.replication_lag_seconds
    • Alert line at 5s
  7. Database Size Growth (line)

    • Metric: postgres.database_size_bytes
    • Projected to 80% threshold

Dashboard 2: API Performance

Widgets:

  1. Request Latency (p50, p95, p99) (multi-line time series)

    • Metric: http.request.duration_ms
    • Percentiles: 50th, 95th, 99th
    • Alert lines at targets
  2. Request Rate (time series)

    • Metric: http.request.count
    • By method (GET, POST, PATCH, DELETE)
  3. Error Rates (stacked area)

    • Metrics: http.response.4xx, http.response.5xx
    • Alert line at 1% for 5xx
  4. Endpoint Performance (top list)

    • Metric: http.request.duration_ms (p95)
    • Grouped by endpoint
    • Sorted by slowest
  5. Timeout Rate (time series)

    • Metric: http.request.timeout_count
    • Alert line at 0.1%
  6. Active Requests (gauge)

    • Metric: http.request.active
    • Shows current load

Dashboard 3: Application Health

Widgets:

  1. CPU Usage (time series)

    • Metric: system.cpu.usage_pct
    • Alert line at 80%
  2. Memory Usage (time series)

    • Metric: system.memory.usage_pct
    • Alert line at 85%
  3. Goroutine Count (time series)

    • Metric: go.goroutines
    • Alert line at 5000
  4. Heap Allocations (time series)

    • Metric: go.heap.alloc_bytes
    • Alert line at 1GB
  5. GC Pause Time (p99) (time series)

    • Metric: go.gc.pause_ns
    • Alert line at 50ms
  6. Error Log Rate (time series)

    • Metric: log.error.count
    • Shows application errors

Dashboard 4: Feature-Specific Metrics

Widgets:

  1. Segment Evaluation Performance (histogram)

    • Metric: segment.evaluation.duration_ms
    • Target: < 100ms
  2. Webhook Delivery Success Rate (gauge)

    • Metric: webhook.delivery.success_rate
    • Target: > 95%
  3. Webhook Retry Queue Depth (time series)

    • Metric: webhook.retry.queue_depth
    • Alert line at 500
  4. Form PDF Generation Time (histogram)

    • Metric: form.pdf.generation_ms
    • Target: < 2s
  5. S3 Upload Performance (time series, p95)

    • Metric: s3.upload.duration_ms
    • Target: < 1s
  6. Audit Log Write Throughput (time series)

    • Metric: audit_log.write.count
    • Shows writes/sec
  7. Scheduling Availability Calculation (histogram)

    • Metric: scheduling.availability.duration_ms
    • Target: < 200ms

CloudWatch Dashboards

CloudWatch is the primary monitoring solution for our AWS infrastructure. See AWS Infrastructure for the full setup.

Custom Metrics Namespace

RestartiX/CoreAPI

Metrics to push:

  • Connection pool stats (via CloudWatch PutMetricData API)
  • Query performance (from query tracer)
  • Request metrics (from HTTP middleware)
  • Application metrics (goroutines, memory, GC)

CloudWatch Logs Insights Queries

Find slow queries:

fields @timestamp, path, method, duration_ms, sql
| filter duration_ms > 500
| sort duration_ms desc
| limit 50

Connection pool exhaustion events:

fields @timestamp, pool, utilization_pct, wait_count
| filter wait_count > 0
| sort @timestamp desc

Error rate by endpoint:

fields @timestamp, path, status_code
| filter status_code >= 500
| stats count() by path
| sort count desc

Alert Configuration

Critical Alerts (PagerDuty)

These require immediate action, typically within 15 minutes.

1. Connection Pool Exhaustion

yaml
name: "PostgreSQL Connection Pool Exhaustion"
type: metric alert
query: "avg(last_5m):avg:postgres.pool.wait_count{env:prod} > 0"
severity: critical
notification:
    - "@pagerduty-engineering"
    - "@slack-alerts"
message: |
    CRITICAL: Connection pool experiencing waits. Requests are being delayed.

    Immediate actions:
    1. Check active connections: SELECT count(*) FROM pg_stat_activity
    2. Identify long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '30 seconds'
    3. Consider emergency connection pool increase or read replica routing

    Runbook: https://docs.restartix.internal/runbooks/connection-pool-exhaustion

2. Database Unreachable

yaml
name: "Database Connection Failed"
type: service check
query: "postgres.can_connect"
severity: critical
notification:
    - "@pagerduty-on-call"
    - "@slack-critical"

3. High Error Rate

yaml
name: "API Error Rate > 5%"
type: metric alert
query: "sum(last_5m):sum:http.response.5xx{env:prod}.as_count() / sum:http.response.total{env:prod}.as_count() > 0.05"
severity: critical
notification:
    - "@pagerduty-engineering"

Warning Alerts (Slack)

These require attention but not immediate action.

1. Connection Pool High Utilization

yaml
name: "Connection Pool > 80%"
type: metric alert
query: "avg(last_10m):avg:postgres.pool.utilization_pct{env:prod} > 80"
severity: warning
notification:
    - "@slack-alerts"
message: |
    WARNING: Connection pool utilization high. Monitor for potential exhaustion.

    Next steps:
    1. Review query performance dashboard for slow queries
    2. Check request latency trends
    3. Plan for scaling if trend continues (read replicas or pool size increase)

2. Slow Query Volume

yaml
name: "High Volume of Slow Queries"
type: log alert
query: 'logs("slow_query").rollup("count").last("10m") > 50'
severity: warning
notification:
    - "@slack-engineering"
message: |
    High volume of slow queries (> 500ms) detected.

    Actions:
    1. Review query tracer logs for patterns
    2. Check for missing indexes
    3. Consider query optimization or caching

3. Replication Lag (Phase 2+)

yaml
name: "Replication Lag > 5 seconds"
type: metric alert
query: "max(last_5m):max:postgres.replication_lag_seconds{env:prod} > 5"
severity: warning
notification:
    - "@slack-alerts"
message: |
    Replication lag exceeding 5 seconds. Read replicas may serve stale data.

    Actions:
    1. Check primary database load
    2. Verify network connectivity between primary and replicas
    3. Consider temporary routing of reads to primary if lag persists

Info Alerts (Slack)

1. Database Size Growth

yaml
name: "Database Approaching 80% Capacity"
type: metric alert
query: "avg(last_1h):avg:postgres.database_size_bytes{env:prod} > 800000000000" # 800GB
severity: info
notification:
    - "@slack-engineering"
message: |
    Database size approaching 80% of allocated capacity (1TB).

    Planning required:
    1. Review data retention policies
    2. Plan migration to larger instance
    3. Consider audit log partitioning/archival

ECS Fargate & CloudWatch Monitoring

Every Fargate service ships container metrics + application logs to CloudWatch automatically. Alarms are defined in Terraform; alerts route through SNS → AWS Chatbot → Slack.

Metrics Available

Per-service ECS metrics (automatic, no SDK calls needed):

  • CPUUtilization (%) — used by auto-scaling target tracking
  • MemoryUtilization (%)
  • Task count (running / desired / pending)
  • Service deployment status (rolling update progress)

ALB metrics (per target group):

  • RequestCount (per target group)
  • TargetResponseTime (p50, p90, p95, p99)
  • HTTPCode_Target_2XX_Count / 4XX_Count / 5XX_Count
  • UnHealthyHostCount — used for the "service unhealthy" alarm
  • RejectedConnectionCount — early signal of capacity exhaustion

RDS metrics:

  • DatabaseConnections — used for the "DB connections > 80%" alarm
  • CPUUtilization, FreeableMemory, FreeStorageSpace
  • ReadIOPS, WriteIOPS — gp3 IOPS usage
  • ReplicaLag (Phase 2+ when read replicas exist)

Custom application metrics (pushed via cloudwatch.PutMetricData):

  • RestartiX/CoreAPI/PoolUtilization
  • RestartiX/CoreAPI/SlowQueryCount
  • RestartiX/CoreAPI/AuditLogWriteFailures
  • Feature-specific metrics (added per F-tier feature)

Via AWS CLI:

bash
# Tail logs in real-time
aws logs tail /ecs/restartix-core-api --follow

# View ECS service status
aws ecs describe-services \
  --cluster restartix-production \
  --services restartix-core-api

# Pull a metric
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=restartix-production Name=ServiceName,Value=restartix-core-api \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

CloudWatch Alarms (defined in Terraform)

Alarms live in infra/modules/monitoring and apply to both staging and prod (with thresholds parameterized per environment). Example:

hcl
resource "aws_cloudwatch_metric_alarm" "core_api_5xx" {
  alarm_name          = "restartix-${var.env}-core-api-5xx"
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 1
  threshold           = var.env == "production" ? 10 : 50
  comparison_operator = "GreaterThanThreshold"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    TargetGroup  = aws_lb_target_group.core_api.arn_suffix
    LoadBalancer = aws_lb.main.arn_suffix
  }
}

External health-check probes (UptimeRobot, Pingdom, BetterStack) can be added if you want a third-party "is the public URL up?" perspective — useful for catching ALB / Cloudflare-level outages that CloudWatch wouldn't see. Optional, not required at launch.


Cloudflare observability

Cloudflare's analytics dashboard is the primary view for edge traffic. The platform-relevant metrics:

MetricWhereWhat it tells you
Total requests, cache hit ratioAnalytics → TrafficOverall traffic shape, CDN effectiveness
Bandwidth saved by cacheAnalytics → TrafficHow much origin egress cost is avoided
Geographic distributionAnalytics → TrafficWhere traffic comes from — useful for capacity planning
WAF block eventsSecurity → EventsBot mitigation, OWASP rule triggers
Bot fight mode mitigationsSecurity → BotsVolume of bot traffic blocked at edge
Custom hostname status (Cloudflare for SaaS)SSL/TLS → Custom HostnamesPer-clinic custom-domain TLS health
SSL/TLS errorsSSL/TLS → Edge CertificatesCert renewal failures, validation errors

For CI / scripting, Cloudflare's GraphQL Analytics API can pull these metrics:

bash
# Example: pull yesterday's request count for the zone
curl -X POST https://api.cloudflare.com/client/v4/graphql \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d @- <<EOF
{
  "query": "query { viewer { zones(filter: {zoneTag: \"$CF_ZONE_ID\"}) { httpRequests1dGroups(limit: 1, filter: {date_geq: \"2026-05-06\"}) { sum { requests cachedRequests } } } } }"
}
EOF

Custom-hostname health monitoring. A scheduled ECS task (cmd/check-cf-hostnames) is on the roadmap — it queries the Cloudflare for SaaS API for the status of every custom hostname registered to clinics in the organization_domains table and surfaces validation failures back to the Console admin UI. This closes alongside the first F-tier custom-domain consumer.


Incident Response Procedures

Runbook: Connection Pool Exhaustion

Symptoms:

  • postgres.pool.wait_count > 0
  • Requests timing out (504 Gateway Timeout)
  • Health check returning "degraded" or "unhealthy"

Investigation:

  1. Check active connections:
sql
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'restartix_platform'
GROUP BY state;
  1. Find long-running queries:
sql
SELECT pid, usename, state, query_start, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '30 seconds'
ORDER BY duration DESC;
  1. Check query wait events:
sql
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;

Resolution:

Immediate (< 5 minutes):

  • Kill long-running queries if identified as non-critical:
    sql
    SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = <PID>;
  • Temporarily increase connection pool size (if headroom available):
    bash
    # Update DB_POOL_MAX in restartix/production/database secret, then force a service restart
    aws secretsmanager update-secret --secret-id restartix/production/database ...
    aws ecs update-service \
      --cluster restartix-production \
      --service restartix-core-api \
      --force-new-deployment
    Production deploys via the regular pipeline require manual approval (see deployment.md). For an in-incident emergency change, the operations IAM role is allowed to call update-service directly.

Short-term (< 1 hour):

  • Add query timeouts if not already configured (see immediate-actions.md)
  • Optimize identified slow queries (add indexes, rewrite queries)
  • Enable read replica routing for read-heavy endpoints (Phase 2+)

Long-term (< 1 week):

Runbook: High Error Rate (5xx)

Symptoms:

  • HTTPCode_Target_5XX_Count spike on the ALB
  • User complaints about "Something went wrong" errors
  • CloudWatch alarm restartix-{env}-core-api-5xx fired

Investigation:

  1. Check error logs:
bash
aws logs filter-log-events \
  --log-group-name /ecs/restartix-core-api \
  --filter-pattern "?error ?fatal ?ERROR ?FATAL" \
  --start-time $(date -d '1 hour ago' +%s000)
  1. Group errors by type:
fields @timestamp, error, path, user_id
| filter level = "ERROR"
| stats count() by error
| sort count desc
  1. Check database connectivity:
bash
curl https://api.restartix.com/health

Resolution:

Database connection errors:

  • Check RDS database status
  • Verify connection string in environment variables
  • Check for network issues (security groups, VPC config)

Application errors:

  • Review recent deployments (rollback if needed)
  • Check for panic/crash logs
  • Verify external service availability (Clerk, Daily.co, S3)

Capacity issues:

  • Check CPU/memory usage (may need vertical scaling)
  • Review goroutine count (potential goroutine leak)
  • Check connection pool status

Runbook: Slow Response Times

Symptoms:

  • http.request.duration_ms (p95/p99) elevated
  • User complaints about "slow" pages
  • Timeout alerts

Investigation:

  1. Identify slow endpoints:
fields @timestamp, path, method, duration_ms
| filter duration_ms > 1000
| stats avg(duration_ms), count() by path
| sort avg desc
  1. Check for slow queries:
fields @timestamp, sql, duration_ms
| filter duration_ms > 500
| sort duration_ms desc
| limit 20
  1. Check database performance:
  • Connection pool utilization
  • Replication lag (Phase 2+)
  • Database CPU/memory usage

Resolution:

Slow queries:

  • Add missing indexes (see database-schema.md)
  • Rewrite inefficient queries
  • Add query result caching (Redis)

High load:

  • Scale horizontally (add the Core API instances)
  • Scale database vertically (larger RDS instance)
  • Enable read replica routing (Phase 2+)

External service latency:

  • Check Daily.co API response times
  • Check S3 upload/download performance
  • Implement circuit breakers for external calls

Monitoring Best Practices

1. Structured Logging

Use Go's slog package for all logging:

go
import "log/slog"

// Good: Structured with context
slog.Info("appointment created",
    "appointment_id", appt.ID,
    "organization_id", appt.OrganizationID,
    "specialist_id", appt.SpecialistID,
    "duration_ms", time.Since(start).Milliseconds(),
)

// Bad: Unstructured string interpolation
log.Printf("Created appointment %d for org %d", appt.ID, appt.OrganizationID)

2. Request Context Propagation

Always pass request context through the call stack:

go
// Good: Context propagation
func (s *Service) Create(ctx context.Context, req *CreateRequest) (*Appointment, error) {
    conn := database.TxFromContext(ctx)
    // ... use conn with context ...
}

// Bad: No context
func (s *Service) Create(req *CreateRequest) (*Appointment, error) {
    // How do you timeout? How do you trace?
}

3. Metric Naming Conventions

Use consistent metric naming:

<namespace>.<entity>.<metric>.<unit>

Examples:
- postgres.pool.utilization_pct
- http.request.duration_ms
- segment.evaluation.duration_ms
- webhook.delivery.success_rate

4. Alert Fatigue Prevention

Good alert characteristics:

  • Actionable (clear next step)
  • Specific (not "something is wrong")
  • Contextual (includes relevant data)
  • Rare (< 1/week for warnings, < 1/month for info)

Bad alerts:

  • "CPU usage > 50%" (too frequent, not actionable)
  • "Error occurred" (too vague)
  • "Database size growing" (without threshold context)

5. Dashboard Organization

Operational dashboards (for on-call):

  • Real-time metrics (5-second refresh)
  • Focus on SLOs and critical alerts
  • Clear visual indicators (red/yellow/green)

Strategic dashboards (for planning):

  • Longer time ranges (7-day, 30-day trends)
  • Capacity planning metrics
  • Cost analysis

Feature-specific dashboards (for developers):

  • Deep-dive into specific subsystems
  • Correlated metrics (e.g., webhook delivery + retry queue)
  • A/B test results, feature flag rollout metrics

Cost Optimization

Datadog Cost Management

Datadog pricing is based on:

  • Hosts (per instance)
  • Custom metrics (number of unique metric names)
  • Log ingestion (GB/month)

Optimization strategies:

  1. Reduce log volume:

    • Sample debug logs (e.g., 10% sampling)
    • Exclude health check logs
    • Use log patterns instead of storing every log
  2. Consolidate metrics:

    • Use tags instead of separate metrics
    • Example: http.request.duration_ms{endpoint:/appointments} instead of http.request.appointments.duration_ms
  3. Use metric rollups:

    • Keep high-resolution data for 7 days
    • Aggregate to 1-minute resolution after 7 days
    • Aggregate to 1-hour resolution after 30 days

Estimated Datadog cost by phase:

PhaseHostsCustom MetricsLogs (GB/mo)Monthly Cost
Phase 12 (Core API + Telemetry)5010 GB$31/mo
Phase 22 + RDS monitoring10050 GB$200/mo

Adding Datadog later (optional, deferred)

The launch stack uses CloudWatch + Sentry. Adding Datadog (or a comparable APM) is a future option, not a planned migration. If/when a trigger fires (typically: distributed tracing across the Core API + Telemetry API + Next.js apps becomes load-bearing for debugging, OR the team grows past the point where CloudWatch dashboards scale operationally):

Pre-add:

  • [ ] Sign up for Datadog account, create API key, evaluate tier (DPA/SCC implications since Datadog is a US sub-processor)
  • [ ] Test Datadog integration in staging
  • [ ] Decide which signals stay in CloudWatch vs migrate to Datadog (alarms can stay AWS-native; dashboards + tracing migrate)

Add:

  • Datadog agent runs as an ECS sidecar in each task definition
  • API key from Secrets Manager → injected at task startup
  • Existing cloudwatch.PutMetricData calls stay; Datadog forwarder picks them up via the AWS integration

No need to remove CloudWatch. They coexist — CloudWatch keeps the AWS-native alarms cheap and the audit trail clean; Datadog adds tracing + advanced dashboards on top.


Testing & Validation

Load Testing

Use k6 or hey to simulate production load:

javascript
// load-test.js (k6)
import http from "k6/http";
import { check, sleep } from "k6";

export let options = {
    stages: [
        { duration: "2m", target: 100 }, // Ramp up to 100 users
        { duration: "5m", target: 100 }, // Stay at 100 users
        { duration: "2m", target: 200 }, // Ramp up to 200 users
        { duration: "5m", target: 200 }, // Stay at 200 users
        { duration: "2m", target: 0 }, // Ramp down
    ],
    thresholds: {
        http_req_duration: ["p(95)<500", "p(99)<1000"], // 95% < 500ms, 99% < 1s
        http_req_failed: ["rate<0.01"], // Error rate < 1%
    },
};

export default function () {
    const res = http.get("https://api.restartix.com/v1/appointments", {
        headers: { Authorization: "Bearer " + __ENV.API_TOKEN },
    });

    check(res, {
        "status is 200": (r) => r.status === 200,
        "response time < 500ms": (r) => r.timings.duration < 500,
    });

    sleep(1);
}

Run load test:

bash
k6 run --vus 200 --duration 10m load-test.js

What to watch during load testing:

  • Connection pool utilization (should not exceed 80%)
  • Query latency (p95, p99)
  • Error rate
  • CPU/memory usage
  • Response time degradation

Chaos Testing

Simulate failures to validate monitoring and alerting:

1. Connection pool exhaustion:

bash
# Temporarily reduce pool size in Secrets Manager (staging only!)
aws secretsmanager update-secret \
  --secret-id restartix/staging/database \
  --secret-string '{"DB_POOL_MAX":"10",...}'

# Force the staging Core API service to restart and pick up the new value
aws ecs update-service \
  --cluster restartix-staging \
  --service restartix-core-api \
  --force-new-deployment

# Run load test
k6 run --vus 50 --duration 5m load-test.js

# Verify alerts fired in Slack
# Verify /health shows degraded status

2. Database unavailability:

bash
# Break database connection (staging only!)
aws secretsmanager update-secret \
  --secret-id restartix/staging/database \
  --secret-string '{"DATABASE_URL":"postgresql://invalid:invalid@localhost/invalid",...}'

aws ecs update-service \
  --cluster restartix-staging \
  --service restartix-core-api \
  --force-new-deployment

# Verify critical alerts
# Verify graceful degradation (503 responses, not crashes)

3. Slow queries:

sql
-- Inject artificial delay (staging only!)
SELECT pg_sleep(10);

Next Steps

  1. Week 1-2: Implement critical monitoring

    • [ ] Connection pool metrics (observability/pool_metrics.go)
    • [ ] Query performance tracer (middleware/query_tracer.go)
    • [ ] Request timeout middleware (middleware/query_timeout.go)
    • [ ] Health checks with metrics (health/handler.go)
  2. Week 3-4: Set up dashboards and alerts

    • [ ] Build CloudWatch dashboards in Terraform (database, API, application)
    • [ ] Configure critical alarms with SNS → AWS Chatbot → Slack routing
    • [ ] Configure Cloudflare alerts (WAF anomaly, custom-hostname status)
    • [ ] Wire Sentry into Core API + Next.js apps for error capture
    • [ ] Test alert delivery end-to-end
  3. Week 5-6: Load testing and optimization

    • [ ] Run load tests to baseline performance
    • [ ] Identify and fix bottlenecks
    • [ ] Tune connection pool sizing
    • [ ] Validate SLO targets
  4. Ongoing:

    • [ ] Weekly dashboard reviews
    • [ ] Monthly load testing (regression detection)
    • [ ] Quarterly alert tuning (reduce noise)
    • [ ] Feature-specific metric additions as features launch

References