Monitoring & Observability
Production-grade observability strategy for the RestartiX platform on AWS ECS Fargate + Cloudflare. The launch stack uses CloudWatch (logs + metrics + alarms) + Sentry (error tracking) + Cloudflare analytics; Datadog and similar APMs are deferred — see aws-infrastructure.md for the broader infrastructure context.
SQL is illustrative
SQL fragments in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migrations live in services/api/migrations/core/.
Why Monitoring Matters for This Architecture
RestartiX Platform uses Row-Level Security (RLS), which means every request holds a database connection for its entire duration. Unlike traditional connection pooling where connections are briefly borrowed and returned, our architecture requires careful monitoring to prevent connection pool exhaustion.
Critical risks without monitoring:
- Connection pool exhaustion → cascading failures
- Slow queries → blocked connections → no capacity for new requests
- Replication lag (Phase 2+) → stale data reads
- Silent performance degradation → poor user experience
Critical Metrics
Database Connection Pool
The most critical metric for this architecture. Connection pool exhaustion is the primary scaling bottleneck.
| Metric | Target | Alert Threshold | Action |
|---|---|---|---|
| Connection pool wait count | 0 | > 0 for 1+ min | Immediate: Investigate long-running queries, consider scaling |
| Active DB connections | < 80% of max | > 80% of max | Warning: Monitor closely, plan capacity increase |
| Idle connections | > 20% of pool | < 10% of pool | Pool undersized or requests not releasing connections |
| Connection pool utilization | < 70% | > 80% | Plan for read replicas (Phase 2) or connection tuning |
Why this matters:
- RLS requires holding a connection for the entire request
- Pool exhaustion = all new requests fail immediately
- No connection = no query = no response = 503 errors
Pool size by phase:
| Phase | Infrastructure | Max Connections | Reserved for App |
|---|---|---|---|
| Phase 1 | AWS RDS PostgreSQL | 100 | 90 (10 for monitoring/admin) |
| Phase 2 | AWS RDS (primary) | 200 | 180 (20 reserved) |
| Phase 2 | AWS RDS (replicas, each) | 200 | 190 (10 reserved) |
Query Performance
| Metric | Target | Alert Threshold | Phase |
|---|---|---|---|
| Query latency (p50) | < 50ms | > 200ms | All |
| Query latency (p95) | < 200ms | > 500ms | All |
| Query latency (p99) | < 500ms | > 1s | All |
| Slow query count | 0 | > 10/min | All |
| Query timeout rate | 0% | > 0.1% | All |
Slow query definition: Any query taking > 500ms
What to investigate:
- Missing indexes on
organization_id(required for RLS) - N+1 queries (multiple queries in loops)
- Full table scans on large tables
- Complex joins without proper indexing
- Segment rule evaluation on large patient sets
Request Performance
| Metric | Target | Alert Threshold | Phase |
|---|---|---|---|
| Request latency (p50) | < 100ms | > 300ms | All |
| Request latency (p95) | < 300ms | > 500ms | All |
| Request latency (p99) | < 500ms | > 1s | All |
| Error rate (4xx) | < 5% | > 10% | All |
| Error rate (5xx) | < 0.1% | > 1% | All |
| Timeout rate | < 0.1% | > 1% | All |
Database Health
| Metric | Target | Alert Threshold | Phase |
|---|---|---|---|
| Database size | N/A | > 80% of allocated | All |
| Replication lag | < 1s | > 5s | Phase 2+ |
| Replica health | All healthy | Any replica down | Phase 2+ |
| Dead tuples | < 5% | > 10% | All |
Application Health
| Metric | Target | Alert Threshold | Phase |
|---|---|---|---|
| CPU usage | < 60% | > 80% | All |
| Memory usage | < 70% | > 85% | All |
| Goroutines | < 1000 | > 5000 | All |
| Heap allocations | < 500MB | > 1GB | All |
| GC pause time (p99) | < 10ms | > 50ms | All |
Feature-Specific Metrics
| Feature | Metric | Target | Alert Threshold |
|---|---|---|---|
| Segments | Rule evaluation time | < 100ms | > 500ms |
| Segments | Patient count calculation | < 200ms | > 1s |
| Webhooks | Delivery success rate | > 95% | < 90% |
| Webhooks | Delivery latency (p95) | < 5s | > 30s |
| Webhooks | Retry queue depth | < 100 | > 500 |
| Forms | PDF generation time | < 2s | > 10s |
| Documents | S3 upload time (p95) | < 1s | > 5s |
| Scheduling | Availability calculation | < 200ms | > 1s |
| Scheduling | Hold expiration accuracy | 100% | < 99.9% |
| Audit (local) | Write throughput (inserts to audit_log) | > 1000/sec | < 100/sec |
| Audit (local) | Write latency (p95) | < 5ms | > 20ms |
| Audit (forwarding) | Telemetry forwarding lag (received_at - created_at) | < 30s | > 5min |
| Audit (forwarding) | Queue depth (Redis audit queue) | < 100 | > 500 |
| Audit (forwarding) | Forward error rate | < 0.1% | > 1% |
Service Level Objectives (SLOs)
API Availability
| Tier | SLO | Monthly Downtime Allowance |
|---|---|---|
| All tiers | 99.5% | 3h 37min |
API Latency
| Endpoint Category | p95 Target | p99 Target |
|---|---|---|
| Read (simple) | < 100ms | < 200ms |
| Read (complex) | < 300ms | < 500ms |
| Write (simple) | < 200ms | < 400ms |
| Write (complex) | < 500ms | < 1s |
| Export/Report | < 5s | < 10s |
Endpoint categories:
- Simple read: GET single resource by ID
- Complex read: List with filtering, joins, or segment evaluation
- Simple write: Create/update single resource
- Complex write: Multi-step operations (appointment booking, form submission with file uploads)
- Export/Report: PDF generation, CSV export, analytics queries
Data Durability
| Data Type | RPO (Recovery Point Objective) | RTO (Recovery Time Objective) |
|---|---|---|
| Database (primary) | 5 minutes | 1 hour |
| Database (replica failover) | 0 (synchronous replication) | 5 minutes |
| File storage (S3) | 0 (11 9's durability) | Immediate |
| Audit logs | 0 (dual write: local + Telemetry) | Immediate |
Observability Stack
Architecture
The launch stack is CloudWatch + Sentry + Cloudflare. Datadog and similar APMs are deferred — see aws-infrastructure.md → What we don't use and why for the rationale (CloudWatch alarms + Sentry error tracking is enough until traffic and team size justify the per-host APM bill).
Application Layer (Core API + Telemetry API + Next.js apps)
├─ Go slog / Node logger (structured JSON)
├─ pgx connection pool stats
├─ HTTP middleware metrics (request count, duration, errors)
└─ Custom business metrics (via CloudWatch PutMetricData)
↓
CloudWatch (logs + metrics + alarms)
├─ Log groups per service (retention configured per env)
├─ Custom metrics namespace: RestartiX/{service}
└─ Alarms → SNS → AWS Chatbot → Slack / email
↓
Sentry (error tracking)
├─ Go and Next.js error capture
├─ Release tracking (image SHA = release tag)
└─ Alert routing for unhandled errors
↓
Cloudflare (edge observability)
├─ Request volume, cache hit ratio, geo distribution
├─ WAF block events, bot mitigation triggers
└─ Cloudflare for SaaS hostname health (custom-domain TLS status)Future / deferred:
- Datadog (or similar APM) — distributed tracing, anomaly detection. Trigger to add: when CloudWatch's basic alarms stop being sufficient or when team size makes the per-engineer APM cost justifiable.
- AWS X-Ray — distributed tracing native to AWS. Cheaper than Datadog but requires SDK integration in every service.
- Grafana / Prometheus self-hosted stack — explicitly rejected; running an observability platform is operational debt for a small team.
Implementation: Connection Pool Monitoring
File: internal/observability/pool_metrics.go
See immediate-actions.md for full implementation.
Key capabilities:
- Logs pool stats every 30 seconds
- Automatic alerts on utilization > 80%
- Error logs when wait count > 0 (immediate attention required)
- Exports metrics to CloudWatch via the
cloudwatch.PutMetricDataAWS SDK call
Usage:
// cmd/api/main.go
poolMetrics := observability.NewPoolMetrics(db, "primary")
poolMetrics.Start()
defer poolMetrics.Stop()
// For read replicas (Phase 2+)
for i, replica := range readReplicas {
metrics := observability.NewPoolMetrics(replica, fmt.Sprintf("replica-%d", i+1))
metrics.Start()
defer metrics.Stop()
}Implementation: Query Performance Tracing
File: internal/middleware/query_tracer.go
See immediate-actions.md for full implementation.
Key capabilities:
- Logs all queries > 500ms with SQL preview
- Critical alerts for queries > 5s
- Tracks request context (request ID, path, user)
- Integrates with pgx tracer hooks
Implementation: Request Timeout Middleware
File: internal/middleware/query_timeout.go
See immediate-actions.md for full implementation.
Timeout values:
- Default: 30 seconds (all requests)
- Long operations: 2 minutes (exports, PDF generation, complex reports)
- Query-level: 5 seconds (individual queries)
- Long queries: 30 seconds (analytics, aggregations)
Why this matters:
- Prevents runaway queries from exhausting connection pool
- Ensures requests fail fast instead of hanging indefinitely
- Provides clear error messages to clients
Implementation: Health Checks
File: internal/health/handler.go
See immediate-actions.md for full implementation.
Health check endpoint: GET /health
Response format:
{
"status": "healthy", // or "degraded", "unhealthy"
"checks": {
"postgresql": {
"status": "healthy",
"metrics": {
"total_conns": 45,
"acquired_conns": 30,
"idle_conns": 15,
"max_conns": 90,
"utilization_pct": 50.0
}
},
"redis": {
"status": "healthy"
}
},
"uptime_seconds": 86400
}ALB target group health check configuration:
Each ECS service has its own ALB target group. Health checks are defined in Terraform on the target group:
resource "aws_lb_target_group" "core_api" {
name = "restartix-core-api-${var.env}"
port = 9000
protocol = "HTTP"
target_type = "ip"
vpc_id = var.vpc_id
health_check {
enabled = true
path = "/health"
protocol = "HTTP"
interval = 10
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
matcher = "200"
}
}Health check is intentionally shallow — /health confirms the process is alive and the connection pool has at least one healthy connection. A deeper /health/ready endpoint that pings every dependency is available for monitoring tools (CloudWatch synthetic checks) but is not used by the ALB, because flaky dependencies would unnecessarily roll otherwise-healthy tasks.
Datadog Dashboards (deferred — for reference)
Status: deferred. The launch stack uses CloudWatch alarms + Sentry — see Observability Stack → Architecture. The dashboards documented in this section are kept as an implementation reference for the day Datadog (or a similar APM) is added — the metric names and widget shapes carry over directly. Until then, the equivalent CloudWatch dashboards in the next section are the source of truth.
Dashboard 1: Database Performance
Widgets:
Connection Pool Utilization (time series)
- Metric:
postgres.pool.utilization_pct - Alert line at 80%
- By pool name (primary, replica-1, replica-2)
- Metric:
Connection Pool Wait Count (time series)
- Metric:
postgres.pool.wait_count - Alert line at 0 (any waits = problem)
- Metric:
Active vs Idle Connections (stacked area)
- Metrics:
postgres.pool.acquired_conns,postgres.pool.idle_conns - Shows pool usage distribution
- Metrics:
Query Latency Distribution (heatmap)
- Metric:
query.duration_ms - Buckets: 0-50ms, 50-100ms, 100-200ms, 200-500ms, 500ms-1s, 1s+
- Metric:
Slow Query Count (time series)
- Metric:
query.slow_count(queries > 500ms) - Alert line at 10/min
- Metric:
Replication Lag (time series, Phase 2+)
- Metric:
postgres.replication_lag_seconds - Alert line at 5s
- Metric:
Database Size Growth (line)
- Metric:
postgres.database_size_bytes - Projected to 80% threshold
- Metric:
Dashboard 2: API Performance
Widgets:
Request Latency (p50, p95, p99) (multi-line time series)
- Metric:
http.request.duration_ms - Percentiles: 50th, 95th, 99th
- Alert lines at targets
- Metric:
Request Rate (time series)
- Metric:
http.request.count - By method (GET, POST, PATCH, DELETE)
- Metric:
Error Rates (stacked area)
- Metrics:
http.response.4xx,http.response.5xx - Alert line at 1% for 5xx
- Metrics:
Endpoint Performance (top list)
- Metric:
http.request.duration_ms(p95) - Grouped by endpoint
- Sorted by slowest
- Metric:
Timeout Rate (time series)
- Metric:
http.request.timeout_count - Alert line at 0.1%
- Metric:
Active Requests (gauge)
- Metric:
http.request.active - Shows current load
- Metric:
Dashboard 3: Application Health
Widgets:
CPU Usage (time series)
- Metric:
system.cpu.usage_pct - Alert line at 80%
- Metric:
Memory Usage (time series)
- Metric:
system.memory.usage_pct - Alert line at 85%
- Metric:
Goroutine Count (time series)
- Metric:
go.goroutines - Alert line at 5000
- Metric:
Heap Allocations (time series)
- Metric:
go.heap.alloc_bytes - Alert line at 1GB
- Metric:
GC Pause Time (p99) (time series)
- Metric:
go.gc.pause_ns - Alert line at 50ms
- Metric:
Error Log Rate (time series)
- Metric:
log.error.count - Shows application errors
- Metric:
Dashboard 4: Feature-Specific Metrics
Widgets:
Segment Evaluation Performance (histogram)
- Metric:
segment.evaluation.duration_ms - Target: < 100ms
- Metric:
Webhook Delivery Success Rate (gauge)
- Metric:
webhook.delivery.success_rate - Target: > 95%
- Metric:
Webhook Retry Queue Depth (time series)
- Metric:
webhook.retry.queue_depth - Alert line at 500
- Metric:
Form PDF Generation Time (histogram)
- Metric:
form.pdf.generation_ms - Target: < 2s
- Metric:
S3 Upload Performance (time series, p95)
- Metric:
s3.upload.duration_ms - Target: < 1s
- Metric:
Audit Log Write Throughput (time series)
- Metric:
audit_log.write.count - Shows writes/sec
- Metric:
Scheduling Availability Calculation (histogram)
- Metric:
scheduling.availability.duration_ms - Target: < 200ms
- Metric:
CloudWatch Dashboards
CloudWatch is the primary monitoring solution for our AWS infrastructure. See AWS Infrastructure for the full setup.
Custom Metrics Namespace
RestartiX/CoreAPI
Metrics to push:
- Connection pool stats (via CloudWatch PutMetricData API)
- Query performance (from query tracer)
- Request metrics (from HTTP middleware)
- Application metrics (goroutines, memory, GC)
CloudWatch Logs Insights Queries
Find slow queries:
fields @timestamp, path, method, duration_ms, sql
| filter duration_ms > 500
| sort duration_ms desc
| limit 50Connection pool exhaustion events:
fields @timestamp, pool, utilization_pct, wait_count
| filter wait_count > 0
| sort @timestamp descError rate by endpoint:
fields @timestamp, path, status_code
| filter status_code >= 500
| stats count() by path
| sort count descAlert Configuration
Critical Alerts (PagerDuty)
These require immediate action, typically within 15 minutes.
1. Connection Pool Exhaustion
name: "PostgreSQL Connection Pool Exhaustion"
type: metric alert
query: "avg(last_5m):avg:postgres.pool.wait_count{env:prod} > 0"
severity: critical
notification:
- "@pagerduty-engineering"
- "@slack-alerts"
message: |
CRITICAL: Connection pool experiencing waits. Requests are being delayed.
Immediate actions:
1. Check active connections: SELECT count(*) FROM pg_stat_activity
2. Identify long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '30 seconds'
3. Consider emergency connection pool increase or read replica routing
Runbook: https://docs.restartix.internal/runbooks/connection-pool-exhaustion2. Database Unreachable
name: "Database Connection Failed"
type: service check
query: "postgres.can_connect"
severity: critical
notification:
- "@pagerduty-on-call"
- "@slack-critical"3. High Error Rate
name: "API Error Rate > 5%"
type: metric alert
query: "sum(last_5m):sum:http.response.5xx{env:prod}.as_count() / sum:http.response.total{env:prod}.as_count() > 0.05"
severity: critical
notification:
- "@pagerduty-engineering"Warning Alerts (Slack)
These require attention but not immediate action.
1. Connection Pool High Utilization
name: "Connection Pool > 80%"
type: metric alert
query: "avg(last_10m):avg:postgres.pool.utilization_pct{env:prod} > 80"
severity: warning
notification:
- "@slack-alerts"
message: |
WARNING: Connection pool utilization high. Monitor for potential exhaustion.
Next steps:
1. Review query performance dashboard for slow queries
2. Check request latency trends
3. Plan for scaling if trend continues (read replicas or pool size increase)2. Slow Query Volume
name: "High Volume of Slow Queries"
type: log alert
query: 'logs("slow_query").rollup("count").last("10m") > 50'
severity: warning
notification:
- "@slack-engineering"
message: |
High volume of slow queries (> 500ms) detected.
Actions:
1. Review query tracer logs for patterns
2. Check for missing indexes
3. Consider query optimization or caching3. Replication Lag (Phase 2+)
name: "Replication Lag > 5 seconds"
type: metric alert
query: "max(last_5m):max:postgres.replication_lag_seconds{env:prod} > 5"
severity: warning
notification:
- "@slack-alerts"
message: |
Replication lag exceeding 5 seconds. Read replicas may serve stale data.
Actions:
1. Check primary database load
2. Verify network connectivity between primary and replicas
3. Consider temporary routing of reads to primary if lag persistsInfo Alerts (Slack)
1. Database Size Growth
name: "Database Approaching 80% Capacity"
type: metric alert
query: "avg(last_1h):avg:postgres.database_size_bytes{env:prod} > 800000000000" # 800GB
severity: info
notification:
- "@slack-engineering"
message: |
Database size approaching 80% of allocated capacity (1TB).
Planning required:
1. Review data retention policies
2. Plan migration to larger instance
3. Consider audit log partitioning/archivalECS Fargate & CloudWatch Monitoring
Every Fargate service ships container metrics + application logs to CloudWatch automatically. Alarms are defined in Terraform; alerts route through SNS → AWS Chatbot → Slack.
Metrics Available
Per-service ECS metrics (automatic, no SDK calls needed):
CPUUtilization(%) — used by auto-scaling target trackingMemoryUtilization(%)- Task count (running / desired / pending)
- Service deployment status (rolling update progress)
ALB metrics (per target group):
RequestCount(per target group)TargetResponseTime(p50, p90, p95, p99)HTTPCode_Target_2XX_Count/4XX_Count/5XX_CountUnHealthyHostCount— used for the "service unhealthy" alarmRejectedConnectionCount— early signal of capacity exhaustion
RDS metrics:
DatabaseConnections— used for the "DB connections > 80%" alarmCPUUtilization,FreeableMemory,FreeStorageSpaceReadIOPS,WriteIOPS— gp3 IOPS usageReplicaLag(Phase 2+ when read replicas exist)
Custom application metrics (pushed via cloudwatch.PutMetricData):
RestartiX/CoreAPI/PoolUtilizationRestartiX/CoreAPI/SlowQueryCountRestartiX/CoreAPI/AuditLogWriteFailures- Feature-specific metrics (added per F-tier feature)
Via AWS CLI:
# Tail logs in real-time
aws logs tail /ecs/restartix-core-api --follow
# View ECS service status
aws ecs describe-services \
--cluster restartix-production \
--services restartix-core-api
# Pull a metric
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=restartix-production Name=ServiceName,Value=restartix-core-api \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics AverageCloudWatch Alarms (defined in Terraform)
Alarms live in infra/modules/monitoring and apply to both staging and prod (with thresholds parameterized per environment). Example:
resource "aws_cloudwatch_metric_alarm" "core_api_5xx" {
alarm_name = "restartix-${var.env}-core-api-5xx"
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
statistic = "Sum"
period = 300
evaluation_periods = 1
threshold = var.env == "production" ? 10 : 50
comparison_operator = "GreaterThanThreshold"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
TargetGroup = aws_lb_target_group.core_api.arn_suffix
LoadBalancer = aws_lb.main.arn_suffix
}
}External health-check probes (UptimeRobot, Pingdom, BetterStack) can be added if you want a third-party "is the public URL up?" perspective — useful for catching ALB / Cloudflare-level outages that CloudWatch wouldn't see. Optional, not required at launch.
Cloudflare observability
Cloudflare's analytics dashboard is the primary view for edge traffic. The platform-relevant metrics:
| Metric | Where | What it tells you |
|---|---|---|
| Total requests, cache hit ratio | Analytics → Traffic | Overall traffic shape, CDN effectiveness |
| Bandwidth saved by cache | Analytics → Traffic | How much origin egress cost is avoided |
| Geographic distribution | Analytics → Traffic | Where traffic comes from — useful for capacity planning |
| WAF block events | Security → Events | Bot mitigation, OWASP rule triggers |
| Bot fight mode mitigations | Security → Bots | Volume of bot traffic blocked at edge |
| Custom hostname status (Cloudflare for SaaS) | SSL/TLS → Custom Hostnames | Per-clinic custom-domain TLS health |
| SSL/TLS errors | SSL/TLS → Edge Certificates | Cert renewal failures, validation errors |
For CI / scripting, Cloudflare's GraphQL Analytics API can pull these metrics:
# Example: pull yesterday's request count for the zone
curl -X POST https://api.cloudflare.com/client/v4/graphql \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"query": "query { viewer { zones(filter: {zoneTag: \"$CF_ZONE_ID\"}) { httpRequests1dGroups(limit: 1, filter: {date_geq: \"2026-05-06\"}) { sum { requests cachedRequests } } } } }"
}
EOFCustom-hostname health monitoring. A scheduled ECS task (cmd/check-cf-hostnames) is on the roadmap — it queries the Cloudflare for SaaS API for the status of every custom hostname registered to clinics in the organization_domains table and surfaces validation failures back to the Console admin UI. This closes alongside the first F-tier custom-domain consumer.
Incident Response Procedures
Runbook: Connection Pool Exhaustion
Symptoms:
postgres.pool.wait_count> 0- Requests timing out (504 Gateway Timeout)
- Health check returning "degraded" or "unhealthy"
Investigation:
- Check active connections:
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'restartix_platform'
GROUP BY state;- Find long-running queries:
SELECT pid, usename, state, query_start, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '30 seconds'
ORDER BY duration DESC;- Check query wait events:
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;Resolution:
Immediate (< 5 minutes):
- Kill long-running queries if identified as non-critical:sql
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = <PID>; - Temporarily increase connection pool size (if headroom available):bashProduction deploys via the regular pipeline require manual approval (see deployment.md). For an in-incident emergency change, the operations IAM role is allowed to call
# Update DB_POOL_MAX in restartix/production/database secret, then force a service restart aws secretsmanager update-secret --secret-id restartix/production/database ... aws ecs update-service \ --cluster restartix-production \ --service restartix-core-api \ --force-new-deploymentupdate-servicedirectly.
Short-term (< 1 hour):
- Add query timeouts if not already configured (see immediate-actions.md)
- Optimize identified slow queries (add indexes, rewrite queries)
- Enable read replica routing for read-heavy endpoints (Phase 2+)
Long-term (< 1 week):
- Add Phase 2 read replicas (see scaling-architecture.md → Lever 5)
- Implement aggressive query result caching (Redis cache-aside per P45)
- Review endpoint patterns for N+1 queries and optimize
Runbook: High Error Rate (5xx)
Symptoms:
HTTPCode_Target_5XX_Countspike on the ALB- User complaints about "Something went wrong" errors
- CloudWatch alarm
restartix-{env}-core-api-5xxfired
Investigation:
- Check error logs:
aws logs filter-log-events \
--log-group-name /ecs/restartix-core-api \
--filter-pattern "?error ?fatal ?ERROR ?FATAL" \
--start-time $(date -d '1 hour ago' +%s000)- Group errors by type:
fields @timestamp, error, path, user_id
| filter level = "ERROR"
| stats count() by error
| sort count desc- Check database connectivity:
curl https://api.restartix.com/healthResolution:
Database connection errors:
- Check RDS database status
- Verify connection string in environment variables
- Check for network issues (security groups, VPC config)
Application errors:
- Review recent deployments (rollback if needed)
- Check for panic/crash logs
- Verify external service availability (Clerk, Daily.co, S3)
Capacity issues:
- Check CPU/memory usage (may need vertical scaling)
- Review goroutine count (potential goroutine leak)
- Check connection pool status
Runbook: Slow Response Times
Symptoms:
http.request.duration_ms(p95/p99) elevated- User complaints about "slow" pages
- Timeout alerts
Investigation:
- Identify slow endpoints:
fields @timestamp, path, method, duration_ms
| filter duration_ms > 1000
| stats avg(duration_ms), count() by path
| sort avg desc- Check for slow queries:
fields @timestamp, sql, duration_ms
| filter duration_ms > 500
| sort duration_ms desc
| limit 20- Check database performance:
- Connection pool utilization
- Replication lag (Phase 2+)
- Database CPU/memory usage
Resolution:
Slow queries:
- Add missing indexes (see database-schema.md)
- Rewrite inefficient queries
- Add query result caching (Redis)
High load:
- Scale horizontally (add the Core API instances)
- Scale database vertically (larger RDS instance)
- Enable read replica routing (Phase 2+)
External service latency:
- Check Daily.co API response times
- Check S3 upload/download performance
- Implement circuit breakers for external calls
Monitoring Best Practices
1. Structured Logging
Use Go's slog package for all logging:
import "log/slog"
// Good: Structured with context
slog.Info("appointment created",
"appointment_id", appt.ID,
"organization_id", appt.OrganizationID,
"specialist_id", appt.SpecialistID,
"duration_ms", time.Since(start).Milliseconds(),
)
// Bad: Unstructured string interpolation
log.Printf("Created appointment %d for org %d", appt.ID, appt.OrganizationID)2. Request Context Propagation
Always pass request context through the call stack:
// Good: Context propagation
func (s *Service) Create(ctx context.Context, req *CreateRequest) (*Appointment, error) {
conn := database.TxFromContext(ctx)
// ... use conn with context ...
}
// Bad: No context
func (s *Service) Create(req *CreateRequest) (*Appointment, error) {
// How do you timeout? How do you trace?
}3. Metric Naming Conventions
Use consistent metric naming:
<namespace>.<entity>.<metric>.<unit>
Examples:
- postgres.pool.utilization_pct
- http.request.duration_ms
- segment.evaluation.duration_ms
- webhook.delivery.success_rate4. Alert Fatigue Prevention
Good alert characteristics:
- Actionable (clear next step)
- Specific (not "something is wrong")
- Contextual (includes relevant data)
- Rare (< 1/week for warnings, < 1/month for info)
Bad alerts:
- "CPU usage > 50%" (too frequent, not actionable)
- "Error occurred" (too vague)
- "Database size growing" (without threshold context)
5. Dashboard Organization
Operational dashboards (for on-call):
- Real-time metrics (5-second refresh)
- Focus on SLOs and critical alerts
- Clear visual indicators (red/yellow/green)
Strategic dashboards (for planning):
- Longer time ranges (7-day, 30-day trends)
- Capacity planning metrics
- Cost analysis
Feature-specific dashboards (for developers):
- Deep-dive into specific subsystems
- Correlated metrics (e.g., webhook delivery + retry queue)
- A/B test results, feature flag rollout metrics
Cost Optimization
Datadog Cost Management
Datadog pricing is based on:
- Hosts (per instance)
- Custom metrics (number of unique metric names)
- Log ingestion (GB/month)
Optimization strategies:
Reduce log volume:
- Sample debug logs (e.g., 10% sampling)
- Exclude health check logs
- Use log patterns instead of storing every log
Consolidate metrics:
- Use tags instead of separate metrics
- Example:
http.request.duration_ms{endpoint:/appointments}instead ofhttp.request.appointments.duration_ms
Use metric rollups:
- Keep high-resolution data for 7 days
- Aggregate to 1-minute resolution after 7 days
- Aggregate to 1-hour resolution after 30 days
Estimated Datadog cost by phase:
| Phase | Hosts | Custom Metrics | Logs (GB/mo) | Monthly Cost |
|---|---|---|---|---|
| Phase 1 | 2 (Core API + Telemetry) | 50 | 10 GB | $31/mo |
| Phase 2 | 2 + RDS monitoring | 100 | 50 GB | $200/mo |
Adding Datadog later (optional, deferred)
The launch stack uses CloudWatch + Sentry. Adding Datadog (or a comparable APM) is a future option, not a planned migration. If/when a trigger fires (typically: distributed tracing across the Core API + Telemetry API + Next.js apps becomes load-bearing for debugging, OR the team grows past the point where CloudWatch dashboards scale operationally):
Pre-add:
- [ ] Sign up for Datadog account, create API key, evaluate tier (DPA/SCC implications since Datadog is a US sub-processor)
- [ ] Test Datadog integration in staging
- [ ] Decide which signals stay in CloudWatch vs migrate to Datadog (alarms can stay AWS-native; dashboards + tracing migrate)
Add:
- Datadog agent runs as an ECS sidecar in each task definition
- API key from Secrets Manager → injected at task startup
- Existing
cloudwatch.PutMetricDatacalls stay; Datadog forwarder picks them up via the AWS integration
No need to remove CloudWatch. They coexist — CloudWatch keeps the AWS-native alarms cheap and the audit trail clean; Datadog adds tracing + advanced dashboards on top.
Testing & Validation
Load Testing
Use k6 or hey to simulate production load:
// load-test.js (k6)
import http from "k6/http";
import { check, sleep } from "k6";
export let options = {
stages: [
{ duration: "2m", target: 100 }, // Ramp up to 100 users
{ duration: "5m", target: 100 }, // Stay at 100 users
{ duration: "2m", target: 200 }, // Ramp up to 200 users
{ duration: "5m", target: 200 }, // Stay at 200 users
{ duration: "2m", target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ["p(95)<500", "p(99)<1000"], // 95% < 500ms, 99% < 1s
http_req_failed: ["rate<0.01"], // Error rate < 1%
},
};
export default function () {
const res = http.get("https://api.restartix.com/v1/appointments", {
headers: { Authorization: "Bearer " + __ENV.API_TOKEN },
});
check(res, {
"status is 200": (r) => r.status === 200,
"response time < 500ms": (r) => r.timings.duration < 500,
});
sleep(1);
}Run load test:
k6 run --vus 200 --duration 10m load-test.jsWhat to watch during load testing:
- Connection pool utilization (should not exceed 80%)
- Query latency (p95, p99)
- Error rate
- CPU/memory usage
- Response time degradation
Chaos Testing
Simulate failures to validate monitoring and alerting:
1. Connection pool exhaustion:
# Temporarily reduce pool size in Secrets Manager (staging only!)
aws secretsmanager update-secret \
--secret-id restartix/staging/database \
--secret-string '{"DB_POOL_MAX":"10",...}'
# Force the staging Core API service to restart and pick up the new value
aws ecs update-service \
--cluster restartix-staging \
--service restartix-core-api \
--force-new-deployment
# Run load test
k6 run --vus 50 --duration 5m load-test.js
# Verify alerts fired in Slack
# Verify /health shows degraded status2. Database unavailability:
# Break database connection (staging only!)
aws secretsmanager update-secret \
--secret-id restartix/staging/database \
--secret-string '{"DATABASE_URL":"postgresql://invalid:invalid@localhost/invalid",...}'
aws ecs update-service \
--cluster restartix-staging \
--service restartix-core-api \
--force-new-deployment
# Verify critical alerts
# Verify graceful degradation (503 responses, not crashes)3. Slow queries:
-- Inject artificial delay (staging only!)
SELECT pg_sleep(10);Next Steps
Week 1-2: Implement critical monitoring
- [ ] Connection pool metrics (observability/pool_metrics.go)
- [ ] Query performance tracer (middleware/query_tracer.go)
- [ ] Request timeout middleware (middleware/query_timeout.go)
- [ ] Health checks with metrics (health/handler.go)
Week 3-4: Set up dashboards and alerts
- [ ] Build CloudWatch dashboards in Terraform (database, API, application)
- [ ] Configure critical alarms with SNS → AWS Chatbot → Slack routing
- [ ] Configure Cloudflare alerts (WAF anomaly, custom-hostname status)
- [ ] Wire Sentry into Core API + Next.js apps for error capture
- [ ] Test alert delivery end-to-end
Week 5-6: Load testing and optimization
- [ ] Run load tests to baseline performance
- [ ] Identify and fix bottlenecks
- [ ] Tune connection pool sizing
- [ ] Validate SLO targets
Ongoing:
- [ ] Weekly dashboard reviews
- [ ] Monthly load testing (regression detection)
- [ ] Quarterly alert tuning (reduce noise)
- [ ] Feature-specific metric additions as features launch
References
- immediate-actions.md - Critical monitoring implementation
- scaling-architecture.md - Infrastructure scaling and capacity planning
- features/webhooks/index.md - Webhook delivery monitoring (planned, Layer 8)
- features/segments/index.md, features/forms/index.md, features/custom-fields/index.md - Segment evaluation, forms, custom fields (Layer 4 / 9)
- reference/rbac-permissions.md, reference/rls-policies.md, reference/encryption.md - Security monitoring requirements