Monitoring & Observability

Production-grade observability strategy for the RestartiX platform on AWS ECS Fargate + Cloudflare. The launch stack uses CloudWatch (logs + metrics + alarms) + Sentry (error tracking) + Cloudflare analytics; Datadog and similar APMs are deferred — see aws-infrastructure.md for the broader infrastructure context.

SQL is illustrative

SQL fragments in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migrations live in services/api/migrations/core/.

Why Monitoring Matters for This Architecture

RestartiX Platform uses Row-Level Security (RLS), which means every request holds a database connection for its entire duration. Unlike traditional connection pooling where connections are briefly borrowed and returned, our architecture requires careful monitoring to prevent connection pool exhaustion.

Critical risks without monitoring:

Connection pool exhaustion → cascading failures
Slow queries → blocked connections → no capacity for new requests
Replication lag (Phase 2+) → stale data reads
Silent performance degradation → poor user experience

Critical Metrics

Database Connection Pool

The most critical metric for this architecture. Connection pool exhaustion is the primary scaling bottleneck.

Metric	Target	Alert Threshold	Action
Connection pool wait count	0	> 0 for 1+ min	Immediate: Investigate long-running queries, consider scaling
Active DB connections	< 80% of max	> 80% of max	Warning: Monitor closely, plan capacity increase
Idle connections	> 20% of pool	< 10% of pool	Pool undersized or requests not releasing connections
Connection pool utilization	< 70%	> 80%	Plan for read replicas (Phase 2) or connection tuning

Why this matters:

RLS requires holding a connection for the entire request
Pool exhaustion = all new requests fail immediately
No connection = no query = no response = 503 errors

Pool size by phase:

Phase	Infrastructure	Max Connections	Reserved for App
Phase 1	AWS RDS PostgreSQL	100	90 (10 for monitoring/admin)
Phase 2	AWS RDS (primary)	200	180 (20 reserved)
Phase 2	AWS RDS (replicas, each)	200	190 (10 reserved)

Query Performance

Metric	Target	Alert Threshold	Phase
Query latency (p50)	< 50ms	> 200ms	All
Query latency (p95)	< 200ms	> 500ms	All
Query latency (p99)	< 500ms	> 1s	All
Slow query count	0	> 10/min	All
Query timeout rate	0%	> 0.1%	All

Slow query definition: Any query taking > 500ms

What to investigate:

Missing indexes on organization_id (required for RLS)
N+1 queries (multiple queries in loops)
Full table scans on large tables
Complex joins without proper indexing
Segment rule evaluation on large patient sets

Request Performance

Metric	Target	Alert Threshold	Phase
Request latency (p50)	< 100ms	> 300ms	All
Request latency (p95)	< 300ms	> 500ms	All
Request latency (p99)	< 500ms	> 1s	All
Error rate (4xx)	< 5%	> 10%	All
Error rate (5xx)	< 0.1%	> 1%	All
Timeout rate	< 0.1%	> 1%	All

Database Health

Metric	Target	Alert Threshold	Phase
Database size	N/A	> 80% of allocated	All
Replication lag	< 1s	> 5s	Phase 2+
Replica health	All healthy	Any replica down	Phase 2+
Dead tuples	< 5%	> 10%	All

Application Health

Metric	Target	Alert Threshold	Phase
CPU usage	< 60%	> 80%	All
Memory usage	< 70%	> 85%	All
Goroutines	< 1000	> 5000	All
Heap allocations	< 500MB	> 1GB	All
GC pause time (p99)	< 10ms	> 50ms	All

Feature-Specific Metrics

Feature	Metric	Target	Alert Threshold
Segments	Rule evaluation time	< 100ms	> 500ms
Segments	Patient count calculation	< 200ms	> 1s
Webhooks	Delivery success rate	> 95%	< 90%
Webhooks	Delivery latency (p95)	< 5s	> 30s
Webhooks	Retry queue depth	< 100	> 500
Forms	PDF generation time	< 2s	> 10s
Documents	S3 upload time (p95)	< 1s	> 5s
Scheduling	Availability calculation	< 200ms	> 1s
Scheduling	Hold expiration accuracy	100%	< 99.9%
Audit (local)	Write throughput (inserts to `audit_log`)	> 1000/sec	< 100/sec
Audit (local)	Write latency (p95)	< 5ms	> 20ms
Audit (forwarding)	Telemetry forwarding lag (`received_at - created_at`)	< 30s	> 5min
Audit (forwarding)	Queue depth (Redis audit queue)	< 100	> 500
Audit (forwarding)	Forward error rate	< 0.1%	> 1%

Service Level Objectives (SLOs)

API Availability

Tier	SLO	Monthly Downtime Allowance
All tiers	99.5%	3h 37min

API Latency

Endpoint Category	p95 Target	p99 Target
Read (simple)	< 100ms	< 200ms
Read (complex)	< 300ms	< 500ms
Write (simple)	< 200ms	< 400ms
Write (complex)	< 500ms	< 1s
Export/Report	< 5s	< 10s

Endpoint categories:

Simple read: GET single resource by ID
Complex read: List with filtering, joins, or segment evaluation
Simple write: Create/update single resource
Complex write: Multi-step operations (appointment booking, form submission with file uploads)
Export/Report: PDF generation, CSV export, analytics queries

Data Durability

Data Type	RPO (Recovery Point Objective)	RTO (Recovery Time Objective)
Database (primary)	5 minutes	1 hour
Database (replica failover)	0 (synchronous replication)	5 minutes
File storage (S3)	0 (11 9's durability)	Immediate
Audit logs	0 (dual write: local + Telemetry)	Immediate

Observability Stack

Architecture

The launch stack is CloudWatch + Sentry + Cloudflare. Datadog and similar APMs are deferred — see aws-infrastructure.md → What we don't use and why for the rationale (CloudWatch alarms + Sentry error tracking is enough until traffic and team size justify the per-host APM bill).

Application Layer (Core API + Telemetry API + Next.js apps)
├─ Go slog / Node logger (structured JSON)
├─ pgx connection pool stats
├─ HTTP middleware metrics (request count, duration, errors)
└─ Custom business metrics (via CloudWatch PutMetricData)

         ↓

CloudWatch (logs + metrics + alarms)
├─ Log groups per service (retention configured per env)
├─ Custom metrics namespace: RestartiX/{service}
└─ Alarms → SNS → AWS Chatbot → Slack / email

         ↓

Sentry (error tracking)
├─ Go and Next.js error capture
├─ Release tracking (image SHA = release tag)
└─ Alert routing for unhandled errors

         ↓

Cloudflare (edge observability)
├─ Request volume, cache hit ratio, geo distribution
├─ WAF block events, bot mitigation triggers
└─ Cloudflare for SaaS hostname health (custom-domain TLS status)

Future / deferred:

Datadog (or similar APM) — distributed tracing, anomaly detection. Trigger to add: when CloudWatch's basic alarms stop being sufficient or when team size makes the per-engineer APM cost justifiable.
AWS X-Ray — distributed tracing native to AWS. Cheaper than Datadog but requires SDK integration in every service.
Grafana / Prometheus self-hosted stack — explicitly rejected; running an observability platform is operational debt for a small team.

Implementation: Connection Pool Monitoring

File: internal/observability/pool_metrics.go

See immediate-actions.md for full implementation.

Key capabilities:

Logs pool stats every 30 seconds
Automatic alerts on utilization > 80%
Error logs when wait count > 0 (immediate attention required)
Exports metrics to CloudWatch via the cloudwatch.PutMetricData AWS SDK call

Usage:

// cmd/api/main.go
poolMetrics := observability.NewPoolMetrics(db, "primary")
poolMetrics.Start()
defer poolMetrics.Stop()

// For read replicas (Phase 2+)
for i, replica := range readReplicas {
    metrics := observability.NewPoolMetrics(replica, fmt.Sprintf("replica-%d", i+1))
    metrics.Start()
    defer metrics.Stop()
}

Implementation: Query Performance Tracing

File: internal/middleware/query_tracer.go

See immediate-actions.md for full implementation.

Key capabilities:

Logs all queries > 500ms with SQL preview
Critical alerts for queries > 5s
Tracks request context (request ID, path, user)
Integrates with pgx tracer hooks

Implementation: Request Timeout Middleware

File: internal/middleware/query_timeout.go

See immediate-actions.md for full implementation.

Timeout values:

Default: 30 seconds (all requests)
Long operations: 2 minutes (exports, PDF generation, complex reports)
Query-level: 5 seconds (individual queries)
Long queries: 30 seconds (analytics, aggregations)

Why this matters:

Prevents runaway queries from exhausting connection pool
Ensures requests fail fast instead of hanging indefinitely
Provides clear error messages to clients

Implementation: Health Checks

File: internal/health/handler.go

See immediate-actions.md for full implementation.

Health check endpoint: GET /health

Response format:

json

{
    "status": "healthy", // or "degraded", "unhealthy"
    "checks": {
        "postgresql": {
            "status": "healthy",
            "metrics": {
                "total_conns": 45,
                "acquired_conns": 30,
                "idle_conns": 15,
                "max_conns": 90,
                "utilization_pct": 50.0
            }
        },
        "redis": {
            "status": "healthy"
        }
    },
    "uptime_seconds": 86400
}

ALB target group health check configuration:

Each ECS service has its own ALB target group. Health checks are defined in Terraform on the target group:

hcl

resource "aws_lb_target_group" "core_api" {
  name        = "restartix-core-api-${var.env}"
  port        = 9000
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = var.vpc_id

  health_check {
    enabled             = true
    path                = "/health"
    protocol            = "HTTP"
    interval            = 10
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Health check is intentionally shallow — /health confirms the process is alive and the connection pool has at least one healthy connection. A deeper /health/ready endpoint that pings every dependency is available for monitoring tools (CloudWatch synthetic checks) but is not used by the ALB, because flaky dependencies would unnecessarily roll otherwise-healthy tasks.

Datadog Dashboards (deferred — for reference)

Status: deferred. The launch stack uses CloudWatch alarms + Sentry — see Observability Stack → Architecture. The dashboards documented in this section are kept as an implementation reference for the day Datadog (or a similar APM) is added — the metric names and widget shapes carry over directly. Until then, the equivalent CloudWatch dashboards in the next section are the source of truth.

Dashboard 1: Database Performance

Widgets:

Connection Pool Utilization (time series)
- Metric: postgres.pool.utilization_pct
- Alert line at 80%
- By pool name (primary, replica-1, replica-2)
Connection Pool Wait Count (time series)
- Metric: postgres.pool.wait_count
- Alert line at 0 (any waits = problem)
Active vs Idle Connections (stacked area)
- Metrics: postgres.pool.acquired_conns, postgres.pool.idle_conns
- Shows pool usage distribution
Query Latency Distribution (heatmap)
- Metric: query.duration_ms
- Buckets: 0-50ms, 50-100ms, 100-200ms, 200-500ms, 500ms-1s, 1s+
Slow Query Count (time series)
- Metric: query.slow_count (queries > 500ms)
- Alert line at 10/min
Replication Lag (time series, Phase 2+)
- Metric: postgres.replication_lag_seconds
- Alert line at 5s
Database Size Growth (line)
- Metric: postgres.database_size_bytes
- Projected to 80% threshold

Dashboard 2: API Performance

Widgets:

Request Latency (p50, p95, p99) (multi-line time series)
- Metric: http.request.duration_ms
- Percentiles: 50th, 95th, 99th
- Alert lines at targets
Request Rate (time series)
- Metric: http.request.count
- By method (GET, POST, PATCH, DELETE)
Error Rates (stacked area)
- Metrics: http.response.4xx, http.response.5xx
- Alert line at 1% for 5xx
Endpoint Performance (top list)
- Metric: http.request.duration_ms (p95)
- Grouped by endpoint
- Sorted by slowest
Timeout Rate (time series)
- Metric: http.request.timeout_count
- Alert line at 0.1%
Active Requests (gauge)
- Metric: http.request.active
- Shows current load

Dashboard 3: Application Health

Widgets:

CPU Usage (time series)
- Metric: system.cpu.usage_pct
- Alert line at 80%
Memory Usage (time series)
- Metric: system.memory.usage_pct
- Alert line at 85%
Goroutine Count (time series)
- Metric: go.goroutines
- Alert line at 5000
Heap Allocations (time series)
- Metric: go.heap.alloc_bytes
- Alert line at 1GB
GC Pause Time (p99) (time series)
- Metric: go.gc.pause_ns
- Alert line at 50ms
Error Log Rate (time series)
- Metric: log.error.count
- Shows application errors

Dashboard 4: Feature-Specific Metrics

Widgets:

Segment Evaluation Performance (histogram)
- Metric: segment.evaluation.duration_ms
- Target: < 100ms
Webhook Delivery Success Rate (gauge)
- Metric: webhook.delivery.success_rate
- Target: > 95%
Webhook Retry Queue Depth (time series)
- Metric: webhook.retry.queue_depth
- Alert line at 500
Form PDF Generation Time (histogram)
- Metric: form.pdf.generation_ms
- Target: < 2s
S3 Upload Performance (time series, p95)
- Metric: s3.upload.duration_ms
- Target: < 1s
Audit Log Write Throughput (time series)
- Metric: audit_log.write.count
- Shows writes/sec
Scheduling Availability Calculation (histogram)
- Metric: scheduling.availability.duration_ms
- Target: < 200ms

CloudWatch Dashboards

CloudWatch is the primary monitoring solution for our AWS infrastructure. See AWS Infrastructure for the full setup.

Custom Metrics Namespace

RestartiX/CoreAPI

Metrics to push:

Connection pool stats (via CloudWatch PutMetricData API)
Query performance (from query tracer)
Request metrics (from HTTP middleware)
Application metrics (goroutines, memory, GC)

CloudWatch Logs Insights Queries

Find slow queries:

fields @timestamp, path, method, duration_ms, sql
| filter duration_ms > 500
| sort duration_ms desc
| limit 50

Connection pool exhaustion events:

fields @timestamp, pool, utilization_pct, wait_count
| filter wait_count > 0
| sort @timestamp desc

Error rate by endpoint:

fields @timestamp, path, status_code
| filter status_code >= 500
| stats count() by path
| sort count desc

Alert Configuration

Critical Alerts (PagerDuty)

These require immediate action, typically within 15 minutes.

1. Connection Pool Exhaustion

yaml

name: "PostgreSQL Connection Pool Exhaustion"
type: metric alert
query: "avg(last_5m):avg:postgres.pool.wait_count{env:prod} > 0"
severity: critical
notification:
    - "@pagerduty-engineering"
    - "@slack-alerts"
message: |
    CRITICAL: Connection pool experiencing waits. Requests are being delayed.

    Immediate actions:
    1. Check active connections: SELECT count(*) FROM pg_stat_activity
    2. Identify long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '30 seconds'
    3. Consider emergency connection pool increase or read replica routing

    Runbook: https://docs.restartix.internal/runbooks/connection-pool-exhaustion

2. Database Unreachable

yaml

name: "Database Connection Failed"
type: service check
query: "postgres.can_connect"
severity: critical
notification:
    - "@pagerduty-on-call"
    - "@slack-critical"

3. High Error Rate

yaml

name: "API Error Rate > 5%"
type: metric alert
query: "sum(last_5m):sum:http.response.5xx{env:prod}.as_count() / sum:http.response.total{env:prod}.as_count() > 0.05"
severity: critical
notification:
    - "@pagerduty-engineering"

Warning Alerts (Slack)

These require attention but not immediate action.

1. Connection Pool High Utilization

yaml

name: "Connection Pool > 80%"
type: metric alert
query: "avg(last_10m):avg:postgres.pool.utilization_pct{env:prod} > 80"
severity: warning
notification:
    - "@slack-alerts"
message: |
    WARNING: Connection pool utilization high. Monitor for potential exhaustion.

    Next steps:
    1. Review query performance dashboard for slow queries
    2. Check request latency trends
    3. Plan for scaling if trend continues (read replicas or pool size increase)

2. Slow Query Volume

yaml

name: "High Volume of Slow Queries"
type: log alert
query: 'logs("slow_query").rollup("count").last("10m") > 50'
severity: warning
notification:
    - "@slack-engineering"
message: |
    High volume of slow queries (> 500ms) detected.

    Actions:
    1. Review query tracer logs for patterns
    2. Check for missing indexes
    3. Consider query optimization or caching

3. Replication Lag (Phase 2+)

yaml

name: "Replication Lag > 5 seconds"
type: metric alert
query: "max(last_5m):max:postgres.replication_lag_seconds{env:prod} > 5"
severity: warning
notification:
    - "@slack-alerts"
message: |
    Replication lag exceeding 5 seconds. Read replicas may serve stale data.

    Actions:
    1. Check primary database load
    2. Verify network connectivity between primary and replicas
    3. Consider temporary routing of reads to primary if lag persists

Info Alerts (Slack)

1. Database Size Growth

yaml

name: "Database Approaching 80% Capacity"
type: metric alert
query: "avg(last_1h):avg:postgres.database_size_bytes{env:prod} > 800000000000" # 800GB
severity: info
notification:
    - "@slack-engineering"
message: |
    Database size approaching 80% of allocated capacity (1TB).

    Planning required:
    1. Review data retention policies
    2. Plan migration to larger instance
    3. Consider audit log partitioning/archival

ECS Fargate & CloudWatch Monitoring

Every Fargate service ships container metrics + application logs to CloudWatch automatically. Alarms are defined in Terraform; alerts route through SNS → AWS Chatbot → Slack.

Metrics Available

Per-service ECS metrics (automatic, no SDK calls needed):

CPUUtilization (%) — used by auto-scaling target tracking
MemoryUtilization (%)
Task count (running / desired / pending)
Service deployment status (rolling update progress)

ALB metrics (per target group):

RequestCount (per target group)
TargetResponseTime (p50, p90, p95, p99)
HTTPCode_Target_2XX_Count / 4XX_Count / 5XX_Count
UnHealthyHostCount — used for the "service unhealthy" alarm
RejectedConnectionCount — early signal of capacity exhaustion

RDS metrics:

DatabaseConnections — used for the "DB connections > 80%" alarm
CPUUtilization, FreeableMemory, FreeStorageSpace
ReadIOPS, WriteIOPS — gp3 IOPS usage
ReplicaLag (Phase 2+ when read replicas exist)

Custom application metrics (pushed via cloudwatch.PutMetricData):

RestartiX/CoreAPI/PoolUtilization
RestartiX/CoreAPI/SlowQueryCount
RestartiX/CoreAPI/AuditLogWriteFailures
Feature-specific metrics (added per F-tier feature)

Via AWS CLI:

bash

# Tail logs in real-time
aws logs tail /ecs/restartix-core-api --follow

# View ECS service status
aws ecs describe-services \
  --cluster restartix-production \
  --services restartix-core-api

# Pull a metric
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=restartix-production Name=ServiceName,Value=restartix-core-api \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

CloudWatch Alarms (defined in Terraform)

Alarms live in infra/modules/monitoring and apply to both staging and prod (with thresholds parameterized per environment). Example:

hcl

resource "aws_cloudwatch_metric_alarm" "core_api_5xx" {
  alarm_name          = "restartix-${var.env}-core-api-5xx"
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 1
  threshold           = var.env == "production" ? 10 : 50
  comparison_operator = "GreaterThanThreshold"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    TargetGroup  = aws_lb_target_group.core_api.arn_suffix
    LoadBalancer = aws_lb.main.arn_suffix
  }
}

External health-check probes (UptimeRobot, Pingdom, BetterStack) can be added if you want a third-party "is the public URL up?" perspective — useful for catching ALB / Cloudflare-level outages that CloudWatch wouldn't see. Optional, not required at launch.

Cloudflare observability

Cloudflare's analytics dashboard is the primary view for edge traffic. The platform-relevant metrics:

Metric	Where	What it tells you
Total requests, cache hit ratio	Analytics → Traffic	Overall traffic shape, CDN effectiveness
Bandwidth saved by cache	Analytics → Traffic	How much origin egress cost is avoided
Geographic distribution	Analytics → Traffic	Where traffic comes from — useful for capacity planning
WAF block events	Security → Events	Bot mitigation, OWASP rule triggers
Bot fight mode mitigations	Security → Bots	Volume of bot traffic blocked at edge
Custom hostname status (Cloudflare for SaaS)	SSL/TLS → Custom Hostnames	Per-clinic custom-domain TLS health
SSL/TLS errors	SSL/TLS → Edge Certificates	Cert renewal failures, validation errors

For CI / scripting, Cloudflare's GraphQL Analytics API can pull these metrics:

bash

# Example: pull yesterday's request count for the zone
curl -X POST https://api.cloudflare.com/client/v4/graphql \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d @- <<EOF
{
  "query": "query { viewer { zones(filter: {zoneTag: \"$CF_ZONE_ID\"}) { httpRequests1dGroups(limit: 1, filter: {date_geq: \"2026-05-06\"}) { sum { requests cachedRequests } } } } }"
}
EOF

Custom-hostname health monitoring. A scheduled ECS task (cmd/check-cf-hostnames) is on the roadmap — it queries the Cloudflare for SaaS API for the status of every custom hostname registered to clinics in the organization_domains table and surfaces validation failures back to the Console admin UI. This closes alongside the first F-tier custom-domain consumer.

Incident Response Procedures

Runbook: Connection Pool Exhaustion

Symptoms:

postgres.pool.wait_count > 0
Requests timing out (504 Gateway Timeout)
Health check returning "degraded" or "unhealthy"

Investigation:

Check active connections:

sql

SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'restartix_platform'
GROUP BY state;

Find long-running queries:

sql

SELECT pid, usename, state, query_start, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '30 seconds'
ORDER BY duration DESC;

Check query wait events:

sql

SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;

Resolution:

Immediate (< 5 minutes):

Kill long-running queries if identified as non-critical:

sql

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = <PID>;

Temporarily increase connection pool size (if headroom available):

bash

# Update DB_POOL_MAX in restartix/production/database secret, then force a service restart
aws secretsmanager update-secret --secret-id restartix/production/database ...
aws ecs update-service \
  --cluster restartix-production \
  --service restartix-core-api \
  --force-new-deployment

Production deploys via the regular pipeline require manual approval (see deployment.md). For an in-incident emergency change, the operations IAM role is allowed to call update-service directly.

Short-term (< 1 hour):

Add query timeouts if not already configured (see immediate-actions.md)
Optimize identified slow queries (add indexes, rewrite queries)
Enable read replica routing for read-heavy endpoints (Phase 2+)

Long-term (< 1 week):

Add Phase 2 read replicas (see scaling-architecture.md → Lever 5)
Implement aggressive query result caching (Redis cache-aside per P45)
Review endpoint patterns for N+1 queries and optimize

Runbook: High Error Rate (5xx)

Symptoms:

HTTPCode_Target_5XX_Count spike on the ALB
User complaints about "Something went wrong" errors
CloudWatch alarm restartix-{env}-core-api-5xx fired

Investigation:

Check error logs:

bash

aws logs filter-log-events \
  --log-group-name /ecs/restartix-core-api \
  --filter-pattern "?error ?fatal ?ERROR ?FATAL" \
  --start-time $(date -d '1 hour ago' +%s000)

Group errors by type:

fields @timestamp, error, path, user_id
| filter level = "ERROR"
| stats count() by error
| sort count desc

Check database connectivity:

bash

curl https://api.restartix.com/health

Resolution:

Database connection errors:

Check RDS database status
Verify connection string in environment variables
Check for network issues (security groups, VPC config)

Application errors:

Review recent deployments (rollback if needed)
Check for panic/crash logs
Verify external service availability (Clerk, Daily.co, S3)

Capacity issues:

Check CPU/memory usage (may need vertical scaling)
Review goroutine count (potential goroutine leak)
Check connection pool status

Runbook: Slow Response Times

Symptoms:

http.request.duration_ms (p95/p99) elevated
User complaints about "slow" pages
Timeout alerts

Investigation:

Identify slow endpoints:

fields @timestamp, path, method, duration_ms
| filter duration_ms > 1000
| stats avg(duration_ms), count() by path
| sort avg desc

Check for slow queries:

fields @timestamp, sql, duration_ms
| filter duration_ms > 500
| sort duration_ms desc
| limit 20

Check database performance:

Connection pool utilization
Replication lag (Phase 2+)
Database CPU/memory usage

Resolution:

Slow queries:

Add missing indexes (see database-schema.md)
Rewrite inefficient queries
Add query result caching (Redis)

High load:

Scale horizontally (add the Core API instances)
Scale database vertically (larger RDS instance)
Enable read replica routing (Phase 2+)

External service latency:

Check Daily.co API response times
Check S3 upload/download performance
Implement circuit breakers for external calls

Monitoring Best Practices

1. Structured Logging

Use Go's slog package for all logging:

import "log/slog"

// Good: Structured with context
slog.Info("appointment created",
    "appointment_id", appt.ID,
    "organization_id", appt.OrganizationID,
    "specialist_id", appt.SpecialistID,
    "duration_ms", time.Since(start).Milliseconds(),
)

// Bad: Unstructured string interpolation
log.Printf("Created appointment %d for org %d", appt.ID, appt.OrganizationID)

2. Request Context Propagation

Always pass request context through the call stack:

// Good: Context propagation
func (s *Service) Create(ctx context.Context, req *CreateRequest) (*Appointment, error) {
    conn := database.TxFromContext(ctx)
    // ... use conn with context ...
}

// Bad: No context
func (s *Service) Create(req *CreateRequest) (*Appointment, error) {
    // How do you timeout? How do you trace?
}

3. Metric Naming Conventions

Use consistent metric naming:

<namespace>.<entity>.<metric>.<unit>

Examples:
- postgres.pool.utilization_pct
- http.request.duration_ms
- segment.evaluation.duration_ms
- webhook.delivery.success_rate

4. Alert Fatigue Prevention

Good alert characteristics:

Actionable (clear next step)
Specific (not "something is wrong")
Contextual (includes relevant data)
Rare (< 1/week for warnings, < 1/month for info)

Bad alerts:

"CPU usage > 50%" (too frequent, not actionable)
"Error occurred" (too vague)
"Database size growing" (without threshold context)

5. Dashboard Organization

Operational dashboards (for on-call):

Real-time metrics (5-second refresh)
Focus on SLOs and critical alerts
Clear visual indicators (red/yellow/green)

Strategic dashboards (for planning):

Longer time ranges (7-day, 30-day trends)
Capacity planning metrics
Cost analysis

Feature-specific dashboards (for developers):

Deep-dive into specific subsystems
Correlated metrics (e.g., webhook delivery + retry queue)
A/B test results, feature flag rollout metrics

Cost Optimization

Datadog Cost Management

Datadog pricing is based on:

Hosts (per instance)
Custom metrics (number of unique metric names)
Log ingestion (GB/month)

Optimization strategies:

Reduce log volume:
- Sample debug logs (e.g., 10% sampling)
- Exclude health check logs
- Use log patterns instead of storing every log
Consolidate metrics:
- Use tags instead of separate metrics
- Example: http.request.duration_ms{endpoint:/appointments} instead of http.request.appointments.duration_ms
Use metric rollups:
- Keep high-resolution data for 7 days
- Aggregate to 1-minute resolution after 7 days
- Aggregate to 1-hour resolution after 30 days

Estimated Datadog cost by phase:

Phase	Hosts	Custom Metrics	Logs (GB/mo)	Monthly Cost
Phase 1	2 (Core API + Telemetry)	50	10 GB	$31/mo
Phase 2	2 + RDS monitoring	100	50 GB	$200/mo

Adding Datadog later (optional, deferred)

The launch stack uses CloudWatch + Sentry. Adding Datadog (or a comparable APM) is a future option, not a planned migration. If/when a trigger fires (typically: distributed tracing across the Core API + Telemetry API + Next.js apps becomes load-bearing for debugging, OR the team grows past the point where CloudWatch dashboards scale operationally):

Pre-add:

[ ] Sign up for Datadog account, create API key, evaluate tier (DPA/SCC implications since Datadog is a US sub-processor)
[ ] Test Datadog integration in staging
[ ] Decide which signals stay in CloudWatch vs migrate to Datadog (alarms can stay AWS-native; dashboards + tracing migrate)

Add:

Datadog agent runs as an ECS sidecar in each task definition
API key from Secrets Manager → injected at task startup
Existing cloudwatch.PutMetricData calls stay; Datadog forwarder picks them up via the AWS integration

No need to remove CloudWatch. They coexist — CloudWatch keeps the AWS-native alarms cheap and the audit trail clean; Datadog adds tracing + advanced dashboards on top.

Testing & Validation

Load Testing

Use k6 or hey to simulate production load:

javascript

// load-test.js (k6)
import http from "k6/http";
import { check, sleep } from "k6";

export let options = {
    stages: [
        { duration: "2m", target: 100 }, // Ramp up to 100 users
        { duration: "5m", target: 100 }, // Stay at 100 users
        { duration: "2m", target: 200 }, // Ramp up to 200 users
        { duration: "5m", target: 200 }, // Stay at 200 users
        { duration: "2m", target: 0 }, // Ramp down
    ],
    thresholds: {
        http_req_duration: ["p(95)<500", "p(99)<1000"], // 95% < 500ms, 99% < 1s
        http_req_failed: ["rate<0.01"], // Error rate < 1%
    },
};

export default function () {
    const res = http.get("https://api.restartix.com/v1/appointments", {
        headers: { Authorization: "Bearer " + __ENV.API_TOKEN },
    });

    check(res, {
        "status is 200": (r) => r.status === 200,
        "response time < 500ms": (r) => r.timings.duration < 500,
    });

    sleep(1);
}

Run load test:

bash

k6 run --vus 200 --duration 10m load-test.js

What to watch during load testing:

Connection pool utilization (should not exceed 80%)
Query latency (p95, p99)
Error rate
CPU/memory usage
Response time degradation

Chaos Testing

Simulate failures to validate monitoring and alerting:

1. Connection pool exhaustion:

bash

# Temporarily reduce pool size in Secrets Manager (staging only!)
aws secretsmanager update-secret \
  --secret-id restartix/staging/database \
  --secret-string '{"DB_POOL_MAX":"10",...}'

# Force the staging Core API service to restart and pick up the new value
aws ecs update-service \
  --cluster restartix-staging \
  --service restartix-core-api \
  --force-new-deployment

# Run load test
k6 run --vus 50 --duration 5m load-test.js

# Verify alerts fired in Slack
# Verify /health shows degraded status

2. Database unavailability:

bash

# Break database connection (staging only!)
aws secretsmanager update-secret \
  --secret-id restartix/staging/database \
  --secret-string '{"DATABASE_URL":"postgresql://invalid:invalid@localhost/invalid",...}'

aws ecs update-service \
  --cluster restartix-staging \
  --service restartix-core-api \
  --force-new-deployment

# Verify critical alerts
# Verify graceful degradation (503 responses, not crashes)

3. Slow queries:

sql

-- Inject artificial delay (staging only!)
SELECT pg_sleep(10);

Next Steps

Week 1-2: Implement critical monitoring
- [ ] Connection pool metrics (observability/pool_metrics.go)
- [ ] Query performance tracer (middleware/query_tracer.go)
- [ ] Request timeout middleware (middleware/query_timeout.go)
- [ ] Health checks with metrics (health/handler.go)
Week 3-4: Set up dashboards and alerts
- [ ] Build CloudWatch dashboards in Terraform (database, API, application)
- [ ] Configure critical alarms with SNS → AWS Chatbot → Slack routing
- [ ] Configure Cloudflare alerts (WAF anomaly, custom-hostname status)
- [ ] Wire Sentry into Core API + Next.js apps for error capture
- [ ] Test alert delivery end-to-end
Week 5-6: Load testing and optimization
- [ ] Run load tests to baseline performance
- [ ] Identify and fix bottlenecks
- [ ] Tune connection pool sizing
- [ ] Validate SLO targets
Ongoing:
- [ ] Weekly dashboard reviews
- [ ] Monthly load testing (regression detection)
- [ ] Quarterly alert tuning (reduce noise)
- [ ] Feature-specific metric additions as features launch

References

immediate-actions.md - Critical monitoring implementation
scaling-architecture.md - Infrastructure scaling and capacity planning
features/webhooks/index.md - Webhook delivery monitoring (planned, Layer 8)
features/segments/index.md, features/forms/index.md, features/custom-fields/index.md - Segment evaluation, forms, custom fields (Layer 4 / 9)
reference/rbac-permissions.md, reference/rls-policies.md, reference/encryption.md - Security monitoring requirements

Monitoring & Observability ​

Why Monitoring Matters for This Architecture ​

Critical Metrics ​

Database Connection Pool ​

Query Performance ​

Request Performance ​

Database Health ​

Application Health ​

Feature-Specific Metrics ​

Service Level Objectives (SLOs) ​

API Availability ​

API Latency ​

Data Durability ​

Observability Stack ​

Architecture ​

Implementation: Connection Pool Monitoring ​

Implementation: Query Performance Tracing ​

Implementation: Request Timeout Middleware ​

Implementation: Health Checks ​

Datadog Dashboards (deferred — for reference) ​

Dashboard 1: Database Performance ​

Dashboard 2: API Performance ​

Dashboard 3: Application Health ​

Dashboard 4: Feature-Specific Metrics ​

CloudWatch Dashboards ​

Custom Metrics Namespace ​

CloudWatch Logs Insights Queries ​

Alert Configuration ​

Critical Alerts (PagerDuty) ​

Warning Alerts (Slack) ​

Info Alerts (Slack) ​

ECS Fargate & CloudWatch Monitoring ​

Metrics Available ​

CloudWatch Alarms (defined in Terraform) ​

Cloudflare observability ​

Incident Response Procedures ​

Runbook: Connection Pool Exhaustion ​

Runbook: High Error Rate (5xx) ​

Runbook: Slow Response Times ​

Monitoring Best Practices ​

1. Structured Logging ​

2. Request Context Propagation ​

3. Metric Naming Conventions ​

4. Alert Fatigue Prevention ​

5. Dashboard Organization ​

Cost Optimization ​

Datadog Cost Management ​

Adding Datadog later (optional, deferred) ​

Testing & Validation ​

Load Testing ​

Chaos Testing ​

Next Steps ​

References ​

Monitoring & Observability

Why Monitoring Matters for This Architecture

Critical Metrics

Database Connection Pool

Query Performance

Request Performance

Database Health

Application Health

Feature-Specific Metrics

Service Level Objectives (SLOs)

API Availability

API Latency

Data Durability

Observability Stack

Architecture

Implementation: Connection Pool Monitoring

Implementation: Query Performance Tracing

Implementation: Request Timeout Middleware

Implementation: Health Checks

Datadog Dashboards (deferred — for reference)

Dashboard 1: Database Performance

Dashboard 2: API Performance

Dashboard 3: Application Health

Dashboard 4: Feature-Specific Metrics

CloudWatch Dashboards

Custom Metrics Namespace

CloudWatch Logs Insights Queries

Alert Configuration

Critical Alerts (PagerDuty)

Warning Alerts (Slack)

Info Alerts (Slack)

ECS Fargate & CloudWatch Monitoring

Metrics Available

CloudWatch Alarms (defined in Terraform)

Cloudflare observability

Incident Response Procedures

Runbook: Connection Pool Exhaustion

Runbook: High Error Rate (5xx)

Runbook: Slow Response Times

Monitoring Best Practices

1. Structured Logging

2. Request Context Propagation

3. Metric Naming Conventions

4. Alert Fatigue Prevention

5. Dashboard Organization

Cost Optimization

Datadog Cost Management

Adding Datadog later (optional, deferred)

Testing & Validation

Load Testing

Chaos Testing

Next Steps

References