Skip to content

Monitoring & Observability

Production-grade observability strategy for RestartiX Platform API — from single-instance AWS App Runner deployment to multi-shard enterprise infrastructure.

Why Monitoring Matters for This Architecture

RestartiX Platform uses Row-Level Security (RLS), which means every request holds a database connection for its entire duration. Unlike traditional connection pooling where connections are briefly borrowed and returned, our architecture requires careful monitoring to prevent connection pool exhaustion.

Critical risks without monitoring:

  • Connection pool exhaustion → cascading failures
  • Slow queries → blocked connections → no capacity for new requests
  • Replication lag (Phase 2+) → stale data reads
  • Silent performance degradation → poor user experience

Critical Metrics

Database Connection Pool

The most critical metric for this architecture. Connection pool exhaustion is the primary scaling bottleneck.

MetricTargetAlert ThresholdAction
Connection pool wait count0> 0 for 1+ minImmediate: Investigate long-running queries, consider scaling
Active DB connections< 80% of max> 80% of maxWarning: Monitor closely, plan capacity increase
Idle connections> 20% of pool< 10% of poolPool undersized or requests not releasing connections
Connection pool utilization< 70%> 80%Plan for read replicas (Phase 2) or connection tuning

Why this matters:

  • RLS requires holding a connection for the entire request
  • Pool exhaustion = all new requests fail immediately
  • No connection = no query = no response = 503 errors

Pool size by phase:

PhaseInfrastructureMax ConnectionsReserved for App
Phase 1AWS RDS PostgreSQL10090 (10 for monitoring/admin)
Phase 2AWS RDS (primary)200180 (20 reserved)
Phase 2AWS RDS (replicas, each)200190 (10 reserved)
Phase 3+Enterprise shard500480 (20 reserved)

Query Performance

MetricTargetAlert ThresholdPhase
Query latency (p50)< 50ms> 200msAll
Query latency (p95)< 200ms> 500msAll
Query latency (p99)< 500ms> 1sAll
Slow query count0> 10/minAll
Query timeout rate0%> 0.1%All

Slow query definition: Any query taking > 500ms

What to investigate:

  • Missing indexes on organization_id (required for RLS)
  • N+1 queries (multiple queries in loops)
  • Full table scans on large tables
  • Complex joins without proper indexing
  • Segment rule evaluation on large patient sets

Request Performance

MetricTargetAlert ThresholdPhase
Request latency (p50)< 100ms> 300msAll
Request latency (p95)< 300ms> 500msAll
Request latency (p99)< 500ms> 1sAll
Error rate (4xx)< 5%> 10%All
Error rate (5xx)< 0.1%> 1%All
Timeout rate< 0.1%> 1%All

Database Health

MetricTargetAlert ThresholdPhase
Database sizeN/A> 80% of allocatedAll
Replication lag< 1s> 5sPhase 2+
Replica healthAll healthyAny replica downPhase 2+
Transaction rateN/A> 10k/secPhase 3+
Dead tuples< 5%> 10%All

Application Health

MetricTargetAlert ThresholdPhase
CPU usage< 60%> 80%All
Memory usage< 70%> 85%All
Goroutines< 1000> 5000All
Heap allocations< 500MB> 1GBAll
GC pause time (p99)< 10ms> 50msAll

Feature-Specific Metrics

FeatureMetricTargetAlert Threshold
SegmentsRule evaluation time< 100ms> 500ms
SegmentsPatient count calculation< 200ms> 1s
WebhooksDelivery success rate> 95%< 90%
WebhooksDelivery latency (p95)< 5s> 30s
WebhooksRetry queue depth< 100> 500
FormsPDF generation time< 2s> 10s
DocumentsS3 upload time (p95)< 1s> 5s
SchedulingAvailability calculation< 200ms> 1s
SchedulingHold expiration accuracy100%< 99.9%
Audit (local)Write throughput (inserts to audit_log)> 1000/sec< 100/sec
Audit (local)Write latency (p95)< 5ms> 20ms
Audit (forwarding)Telemetry forwarding lag (received_at - created_at)< 30s> 5min
Audit (forwarding)Queue depth (Redis audit queue)< 100> 500
Audit (forwarding)Forward error rate< 0.1%> 1%

Service Level Objectives (SLOs)

API Availability

TierSLOMonthly Downtime Allowance
Shared (Phase 1-2)99.5%3h 37min
Shared (Phase 3)99.9%43min 50sec
Enterprise99.95%21min 55sec

API Latency

Endpoint Categoryp95 Targetp99 Target
Read (simple)< 100ms< 200ms
Read (complex)< 300ms< 500ms
Write (simple)< 200ms< 400ms
Write (complex)< 500ms< 1s
Export/Report< 5s< 10s

Endpoint categories:

  • Simple read: GET single resource by ID
  • Complex read: List with filtering, joins, or segment evaluation
  • Simple write: Create/update single resource
  • Complex write: Multi-step operations (appointment booking, form submission with file uploads)
  • Export/Report: PDF generation, CSV export, analytics queries

Data Durability

Data TypeRPO (Recovery Point Objective)RTO (Recovery Time Objective)
Database (primary)5 minutes1 hour
Database (replica failover)0 (synchronous replication)5 minutes
File storage (S3)0 (11 9's durability)Immediate
Audit logs0 (dual write: local + Telemetry)Immediate

Observability Stack

Architecture

Application Layer
├─ Go slog (structured JSON logging)
├─ pgx connection pool stats
├─ HTTP middleware metrics
└─ Custom business metrics



CloudWatch Logs



Aggregation & Analysis
├─ CloudWatch (primary, AWS-native)
├─ Datadog (alternative for advanced features)
└─ Grafana Loki + Prometheus (self-hosted option)



Dashboards + Alerts
├─ Real-time dashboards
├─ Alert routing (PagerDuty, Slack)
└─ Incident tracking

Implementation: Connection Pool Monitoring

File: internal/observability/pool_metrics.go

See immediate-actions.md for full implementation.

Key capabilities:

  • Logs pool stats every 30 seconds
  • Automatic alerts on utilization > 80%
  • Error logs when wait count > 0 (immediate attention required)
  • Exports metrics to Datadog/CloudWatch via StatsD

Usage:

go
// cmd/api/main.go
poolMetrics := observability.NewPoolMetrics(db, "primary")
poolMetrics.Start()
defer poolMetrics.Stop()

// For read replicas (Phase 2+)
for i, replica := range readReplicas {
    metrics := observability.NewPoolMetrics(replica, fmt.Sprintf("replica-%d", i+1))
    metrics.Start()
    defer metrics.Stop()
}

Implementation: Query Performance Tracing

File: internal/middleware/query_tracer.go

See immediate-actions.md for full implementation.

Key capabilities:

  • Logs all queries > 500ms with SQL preview
  • Critical alerts for queries > 5s
  • Tracks request context (request ID, path, user)
  • Integrates with pgx tracer hooks

Implementation: Request Timeout Middleware

File: internal/middleware/query_timeout.go

See immediate-actions.md for full implementation.

Timeout values:

  • Default: 30 seconds (all requests)
  • Long operations: 2 minutes (exports, PDF generation, complex reports)
  • Query-level: 5 seconds (individual queries)
  • Long queries: 30 seconds (analytics, aggregations)

Why this matters:

  • Prevents runaway queries from exhausting connection pool
  • Ensures requests fail fast instead of hanging indefinitely
  • Provides clear error messages to clients

Implementation: Health Checks

File: internal/health/handler.go

See immediate-actions.md for full implementation.

Health check endpoint: GET /health

Response format:

json
{
    "status": "healthy", // or "degraded", "unhealthy"
    "checks": {
        "postgresql": {
            "status": "healthy",
            "metrics": {
                "total_conns": 45,
                "acquired_conns": 30,
                "idle_conns": 15,
                "max_conns": 90,
                "utilization_pct": 50.0
            }
        },
        "redis": {
            "status": "healthy"
        }
    },
    "uptime_seconds": 86400
}

App Runner health check configuration:

App Runner health checks are configured in the service definition:

json
{
    "healthCheckConfiguration": {
        "protocol": "HTTP",
        "path": "/health",
        "interval": 10,
        "timeout": 5,
        "healthyThreshold": 1,
        "unhealthyThreshold": 3
    }
}

Datadog Dashboards

Dashboard 1: Database Performance

Widgets:

  1. Connection Pool Utilization (time series)

    • Metric: postgres.pool.utilization_pct
    • Alert line at 80%
    • By pool name (primary, replica-1, replica-2)
  2. Connection Pool Wait Count (time series)

    • Metric: postgres.pool.wait_count
    • Alert line at 0 (any waits = problem)
  3. Active vs Idle Connections (stacked area)

    • Metrics: postgres.pool.acquired_conns, postgres.pool.idle_conns
    • Shows pool usage distribution
  4. Query Latency Distribution (heatmap)

    • Metric: query.duration_ms
    • Buckets: 0-50ms, 50-100ms, 100-200ms, 200-500ms, 500ms-1s, 1s+
  5. Slow Query Count (time series)

    • Metric: query.slow_count (queries > 500ms)
    • Alert line at 10/min
  6. Replication Lag (time series, Phase 2+)

    • Metric: postgres.replication_lag_seconds
    • Alert line at 5s
  7. Database Size Growth (line)

    • Metric: postgres.database_size_bytes
    • Projected to 80% threshold

Dashboard 2: API Performance

Widgets:

  1. Request Latency (p50, p95, p99) (multi-line time series)

    • Metric: http.request.duration_ms
    • Percentiles: 50th, 95th, 99th
    • Alert lines at targets
  2. Request Rate (time series)

    • Metric: http.request.count
    • By method (GET, POST, PATCH, DELETE)
  3. Error Rates (stacked area)

    • Metrics: http.response.4xx, http.response.5xx
    • Alert line at 1% for 5xx
  4. Endpoint Performance (top list)

    • Metric: http.request.duration_ms (p95)
    • Grouped by endpoint
    • Sorted by slowest
  5. Timeout Rate (time series)

    • Metric: http.request.timeout_count
    • Alert line at 0.1%
  6. Active Requests (gauge)

    • Metric: http.request.active
    • Shows current load

Dashboard 3: Application Health

Widgets:

  1. CPU Usage (time series)

    • Metric: system.cpu.usage_pct
    • Alert line at 80%
  2. Memory Usage (time series)

    • Metric: system.memory.usage_pct
    • Alert line at 85%
  3. Goroutine Count (time series)

    • Metric: go.goroutines
    • Alert line at 5000
  4. Heap Allocations (time series)

    • Metric: go.heap.alloc_bytes
    • Alert line at 1GB
  5. GC Pause Time (p99) (time series)

    • Metric: go.gc.pause_ns
    • Alert line at 50ms
  6. Error Log Rate (time series)

    • Metric: log.error.count
    • Shows application errors

Dashboard 4: Feature-Specific Metrics

Widgets:

  1. Segment Evaluation Performance (histogram)

    • Metric: segment.evaluation.duration_ms
    • Target: < 100ms
  2. Webhook Delivery Success Rate (gauge)

    • Metric: webhook.delivery.success_rate
    • Target: > 95%
  3. Webhook Retry Queue Depth (time series)

    • Metric: webhook.retry.queue_depth
    • Alert line at 500
  4. Form PDF Generation Time (histogram)

    • Metric: form.pdf.generation_ms
    • Target: < 2s
  5. S3 Upload Performance (time series, p95)

    • Metric: s3.upload.duration_ms
    • Target: < 1s
  6. Audit Log Write Throughput (time series)

    • Metric: audit_log.write.count
    • Shows writes/sec
  7. Scheduling Availability Calculation (histogram)

    • Metric: scheduling.availability.duration_ms
    • Target: < 200ms

CloudWatch Dashboards

CloudWatch is the primary monitoring solution for our AWS infrastructure. See AWS Infrastructure for the full setup.

Custom Metrics Namespace

RestartiX/CoreAPI

Metrics to push:

  • Connection pool stats (via CloudWatch PutMetricData API)
  • Query performance (from query tracer)
  • Request metrics (from HTTP middleware)
  • Application metrics (goroutines, memory, GC)

CloudWatch Logs Insights Queries

Find slow queries:

fields @timestamp, path, method, duration_ms, sql
| filter duration_ms > 500
| sort duration_ms desc
| limit 50

Connection pool exhaustion events:

fields @timestamp, pool, utilization_pct, wait_count
| filter wait_count > 0
| sort @timestamp desc

Error rate by endpoint:

fields @timestamp, path, status_code
| filter status_code >= 500
| stats count() by path
| sort count desc

Alert Configuration

Critical Alerts (PagerDuty)

These require immediate action, typically within 15 minutes.

1. Connection Pool Exhaustion

yaml
name: "PostgreSQL Connection Pool Exhaustion"
type: metric alert
query: "avg(last_5m):avg:postgres.pool.wait_count{env:prod} > 0"
severity: critical
notification:
    - "@pagerduty-engineering"
    - "@slack-alerts"
message: |
    CRITICAL: Connection pool experiencing waits. Requests are being delayed.

    Immediate actions:
    1. Check active connections: SELECT count(*) FROM pg_stat_activity
    2. Identify long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '30 seconds'
    3. Consider emergency connection pool increase or read replica routing

    Runbook: https://docs.restartix.internal/runbooks/connection-pool-exhaustion

2. Database Unreachable

yaml
name: "Database Connection Failed"
type: service check
query: "postgres.can_connect"
severity: critical
notification:
    - "@pagerduty-on-call"
    - "@slack-critical"

3. High Error Rate

yaml
name: "API Error Rate > 5%"
type: metric alert
query: "sum(last_5m):sum:http.response.5xx{env:prod}.as_count() / sum:http.response.total{env:prod}.as_count() > 0.05"
severity: critical
notification:
    - "@pagerduty-engineering"

Warning Alerts (Slack)

These require attention but not immediate action.

1. Connection Pool High Utilization

yaml
name: "Connection Pool > 80%"
type: metric alert
query: "avg(last_10m):avg:postgres.pool.utilization_pct{env:prod} > 80"
severity: warning
notification:
    - "@slack-alerts"
message: |
    WARNING: Connection pool utilization high. Monitor for potential exhaustion.

    Next steps:
    1. Review query performance dashboard for slow queries
    2. Check request latency trends
    3. Plan for scaling if trend continues (read replicas or pool size increase)

2. Slow Query Volume

yaml
name: "High Volume of Slow Queries"
type: log alert
query: 'logs("slow_query").rollup("count").last("10m") > 50'
severity: warning
notification:
    - "@slack-engineering"
message: |
    High volume of slow queries (> 500ms) detected.

    Actions:
    1. Review query tracer logs for patterns
    2. Check for missing indexes
    3. Consider query optimization or caching

3. Replication Lag (Phase 2+)

yaml
name: "Replication Lag > 5 seconds"
type: metric alert
query: "max(last_5m):max:postgres.replication_lag_seconds{env:prod} > 5"
severity: warning
notification:
    - "@slack-alerts"
message: |
    Replication lag exceeding 5 seconds. Read replicas may serve stale data.

    Actions:
    1. Check primary database load
    2. Verify network connectivity between primary and replicas
    3. Consider temporary routing of reads to primary if lag persists

Info Alerts (Slack)

1. Database Size Growth

yaml
name: "Database Approaching 80% Capacity"
type: metric alert
query: "avg(last_1h):avg:postgres.database_size_bytes{env:prod} > 800000000000" # 800GB
severity: info
notification:
    - "@slack-engineering"
message: |
    Database size approaching 80% of allocated capacity (1TB).

    Planning required:
    1. Review data retention policies
    2. Plan migration to larger instance
    3. Consider audit log partitioning/archival
    4. Evaluate if Phase 3 (sharding) is needed

AWS App Runner & CloudWatch Monitoring

AWS App Runner and CloudWatch provide built-in monitoring for production deployments.

Metrics Available

Via AWS CloudWatch Console:

  • CPU usage (%)
  • Memory usage (MB)
  • Request count and latency
  • Active instances
  • HTTP 2xx/4xx/5xx response counts

Via AWS CLI:

bash
# View logs in real-time
aws logs tail /ecs/restartix-core-api --follow

# View App Runner service metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/AppRunner \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=restartix-platform \
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

# Check service status
aws apprunner describe-service --service-arn <service-arn>

CloudWatch Alerts

CloudWatch provides native alerting via CloudWatch Alarms.

Option 1: CloudWatch Alarms (recommended)

bash
# Create alarm for high error rate
aws cloudwatch put-metric-alarm \
  --alarm-name restartix-platform-5xx-errors \
  --metric-name 5xxCount \
  --namespace AWS/AppRunner \
  --statistic Sum \
  --period 300 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions <sns-topic-arn>

Option 2: Health check monitoring (external)

  • Use UptimeRobot or Pingdom to monitor /health endpoint
  • Configure alerts for downtime or degraded status

Option 3: Custom monitoring script

bash
#!/bin/bash
# cloudwatch-monitor.sh - Run every 5 minutes via cron

HEALTH_URL="https://api.restartix.com/health"
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

response=$(curl -s -w "%{http_code}" -o /tmp/health.json $HEALTH_URL)

if [ "$response" != "200" ]; then
  curl -X POST $SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{
    "text": "Health check failed: HTTP '$response'",
    "attachments": [{
      "color": "danger",
      "text": "'"$(cat /tmp/health.json)"'"
    }]
  }'
fi

Incident Response Procedures

Runbook: Connection Pool Exhaustion

Symptoms:

  • postgres.pool.wait_count > 0
  • Requests timing out (504 Gateway Timeout)
  • Health check returning "degraded" or "unhealthy"

Investigation:

  1. Check active connections:
sql
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'restartix_platform'
GROUP BY state;
  1. Find long-running queries:
sql
SELECT pid, usename, state, query_start, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '30 seconds'
ORDER BY duration DESC;
  1. Check query wait events:
sql
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;

Resolution:

Immediate (< 5 minutes):

  • Kill long-running queries if identified as non-critical:
    sql
    SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = <PID>;
  • Temporarily increase connection pool size (if headroom available):
    bash
    # Update DB_POOL_MAX_CONNS in AWS Secrets Manager, then redeploy
    aws secretsmanager update-secret --secret-id restartix-prod/env ...
    git push origin main  # GitHub Actions handles the rest

Short-term (< 1 hour):

  • Add query timeouts if not already configured (see immediate-actions.md)
  • Optimize identified slow queries (add indexes, rewrite queries)
  • Enable read replica routing for read-heavy endpoints (Phase 2+)

Long-term (< 1 week):

  • Migrate to Phase 2 infrastructure (AWS RDS with read replicas)
  • Implement aggressive query result caching (Redis)
  • Review endpoint patterns for N+1 queries and optimize

Runbook: High Error Rate (5xx)

Symptoms:

  • http.response.5xx spike
  • User complaints about "Something went wrong" errors
  • Datadog/CloudWatch alert

Investigation:

  1. Check error logs:
bash
aws logs filter-log-events \
  --log-group-name /ecs/restartix-core-api \
  --filter-pattern "?error ?fatal ?ERROR ?FATAL" \
  --start-time $(date -d '1 hour ago' +%s000)
  1. Group errors by type:
fields @timestamp, error, path, user_id
| filter level = "ERROR"
| stats count() by error
| sort count desc
  1. Check database connectivity:
bash
curl https://api.restartix.com/health

Resolution:

Database connection errors:

  • Check RDS database status
  • Verify connection string in environment variables
  • Check for network issues (security groups, VPC config)

Application errors:

  • Review recent deployments (rollback if needed)
  • Check for panic/crash logs
  • Verify external service availability (Clerk, Daily.co, S3)

Capacity issues:

  • Check CPU/memory usage (may need vertical scaling)
  • Review goroutine count (potential goroutine leak)
  • Check connection pool status

Runbook: Slow Response Times

Symptoms:

  • http.request.duration_ms (p95/p99) elevated
  • User complaints about "slow" pages
  • Timeout alerts

Investigation:

  1. Identify slow endpoints:
fields @timestamp, path, method, duration_ms
| filter duration_ms > 1000
| stats avg(duration_ms), count() by path
| sort avg desc
  1. Check for slow queries:
fields @timestamp, sql, duration_ms
| filter duration_ms > 500
| sort duration_ms desc
| limit 20
  1. Check database performance:
  • Connection pool utilization
  • Replication lag (Phase 2+)
  • Database CPU/memory usage

Resolution:

Slow queries:

  • Add missing indexes (see database-schema.md)
  • Rewrite inefficient queries
  • Add query result caching (Redis)

High load:

  • Scale horizontally (add the Core API instances)
  • Scale database vertically (larger RDS instance)
  • Enable read replica routing (Phase 2+)

External service latency:

  • Check Daily.co API response times
  • Check S3 upload/download performance
  • Implement circuit breakers for external calls

Monitoring Best Practices

1. Structured Logging

Use Go's slog package for all logging:

go
import "log/slog"

// Good: Structured with context
slog.Info("appointment created",
    "appointment_id", appt.ID,
    "organization_id", appt.OrganizationID,
    "specialist_id", appt.SpecialistID,
    "duration_ms", time.Since(start).Milliseconds(),
)

// Bad: Unstructured string interpolation
log.Printf("Created appointment %d for org %d", appt.ID, appt.OrganizationID)

2. Request Context Propagation

Always pass request context through the call stack:

go
// Good: Context propagation
func (s *Service) Create(ctx context.Context, req *CreateRequest) (*Appointment, error) {
    conn := database.ConnFromContext(ctx)
    // ... use conn with context ...
}

// Bad: No context
func (s *Service) Create(req *CreateRequest) (*Appointment, error) {
    // How do you timeout? How do you trace?
}

3. Metric Naming Conventions

Use consistent metric naming:

<namespace>.<entity>.<metric>.<unit>

Examples:
- postgres.pool.utilization_pct
- http.request.duration_ms
- segment.evaluation.duration_ms
- webhook.delivery.success_rate

4. Alert Fatigue Prevention

Good alert characteristics:

  • Actionable (clear next step)
  • Specific (not "something is wrong")
  • Contextual (includes relevant data)
  • Rare (< 1/week for warnings, < 1/month for info)

Bad alerts:

  • "CPU usage > 50%" (too frequent, not actionable)
  • "Error occurred" (too vague)
  • "Database size growing" (without threshold context)

5. Dashboard Organization

Operational dashboards (for on-call):

  • Real-time metrics (5-second refresh)
  • Focus on SLOs and critical alerts
  • Clear visual indicators (red/yellow/green)

Strategic dashboards (for planning):

  • Longer time ranges (7-day, 30-day trends)
  • Capacity planning metrics
  • Cost analysis

Feature-specific dashboards (for developers):

  • Deep-dive into specific subsystems
  • Correlated metrics (e.g., webhook delivery + retry queue)
  • A/B test results, feature flag rollout metrics

Cost Optimization

Datadog Cost Management

Datadog pricing is based on:

  • Hosts (per instance)
  • Custom metrics (number of unique metric names)
  • Log ingestion (GB/month)

Optimization strategies:

  1. Reduce log volume:

    • Sample debug logs (e.g., 10% sampling)
    • Exclude health check logs
    • Use log patterns instead of storing every log
  2. Consolidate metrics:

    • Use tags instead of separate metrics
    • Example: http.request.duration_ms{endpoint:/appointments} instead of http.request.appointments.duration_ms
  3. Use metric rollups:

    • Keep high-resolution data for 7 days
    • Aggregate to 1-minute resolution after 7 days
    • Aggregate to 1-hour resolution after 30 days

Estimated Datadog cost by phase:

PhaseHostsCustom MetricsLogs (GB/mo)Monthly Cost
Phase 12 (Core API + Telemetry)5010 GB$31/mo
Phase 22 + RDS monitoring10050 GB$200/mo
Phase 312 (shared + 10 enterprise)200200 GB$800/mo
Phase 450+ (multi-shard)5001 TB$3,000/mo

Migration Between Monitoring Stacks

CloudWatch → Datadog (optional upgrade)

Pre-migration:

  • [ ] Sign up for Datadog account
  • [ ] Create API key
  • [ ] Test Datadog integration in staging

Migration:

bash
# 1. Set Datadog API key in AWS Secrets Manager
aws secretsmanager update-secret \
  --secret-id restartix-prod/env \
  --secret-string '{"DD_API_KEY":"<your-key>","DD_SITE":"datadoghq.com",...}'

# 2. Add Datadog agent as a sidecar or use statsd client in application code

# 3. Deploy and verify
git push origin main

Post-migration:

  • [ ] Create dashboards
  • [ ] Configure alerts
  • [ ] Test alert routing (Slack, PagerDuty)
  • [ ] Document runbooks

Datadog → CloudWatch (If reverting to AWS-native)

Pre-migration:

  • [ ] Create CloudWatch Log Groups
  • [ ] Configure IAM roles for metric publishing
  • [ ] Test custom metric publishing from staging

Migration:

go
// Replace Datadog client with CloudWatch SDK
import "github.com/aws/aws-sdk-go-v2/service/cloudwatch"

// Publish custom metrics
client := cloudwatch.NewFromConfig(cfg)
client.PutMetricData(ctx, &cloudwatch.PutMetricDataInput{
    Namespace: aws.String("RestartiX/CoreAPI"),
    MetricData: []types.MetricDatum{
        {
            MetricName: aws.String("ConnectionPoolUtilization"),
            Value:      aws.Float64(utilizationPct),
            Unit:       types.StandardUnitPercent,
        },
    },
})

Testing & Validation

Load Testing

Use k6 or hey to simulate production load:

javascript
// load-test.js (k6)
import http from "k6/http";
import { check, sleep } from "k6";

export let options = {
    stages: [
        { duration: "2m", target: 100 }, // Ramp up to 100 users
        { duration: "5m", target: 100 }, // Stay at 100 users
        { duration: "2m", target: 200 }, // Ramp up to 200 users
        { duration: "5m", target: 200 }, // Stay at 200 users
        { duration: "2m", target: 0 }, // Ramp down
    ],
    thresholds: {
        http_req_duration: ["p(95)<500", "p(99)<1000"], // 95% < 500ms, 99% < 1s
        http_req_failed: ["rate<0.01"], // Error rate < 1%
    },
};

export default function () {
    const res = http.get("https://api.restartix.com/v1/appointments", {
        headers: { Authorization: "Bearer " + __ENV.API_TOKEN },
    });

    check(res, {
        "status is 200": (r) => r.status === 200,
        "response time < 500ms": (r) => r.timings.duration < 500,
    });

    sleep(1);
}

Run load test:

bash
k6 run --vus 200 --duration 10m load-test.js

What to watch during load testing:

  • Connection pool utilization (should not exceed 80%)
  • Query latency (p95, p99)
  • Error rate
  • CPU/memory usage
  • Response time degradation

Chaos Testing

Simulate failures to validate monitoring and alerting:

1. Connection pool exhaustion:

bash
# Temporarily reduce pool size in AWS Secrets Manager (staging only!)
aws secretsmanager update-secret \
  --secret-id restartix-staging/env \
  --secret-string '{"DB_POOL_MAX_CONNS":"10",...}'
git push origin main

# Run load test
k6 run --vus 50 --duration 5m load-test.js

# Verify alerts fired
# Verify health check shows degraded status

2. Database unavailability:

bash
# Break database connection (staging only!)
aws secretsmanager update-secret \
  --secret-id restartix-staging/env \
  --secret-string '{"DATABASE_URL":"postgresql://invalid:invalid@localhost/invalid",...}'
git push origin main

# Verify critical alerts
# Verify graceful degradation (503 responses, not crashes)

3. Slow queries:

sql
-- Inject artificial delay (staging only!)
SELECT pg_sleep(10);

Next Steps

  1. Week 1-2: Implement critical monitoring

    • [ ] Connection pool metrics (observability/pool_metrics.go)
    • [ ] Query performance tracer (middleware/query_tracer.go)
    • [ ] Request timeout middleware (middleware/query_timeout.go)
    • [ ] Health checks with metrics (health/handler.go)
  2. Week 3-4: Set up dashboards and alerts

    • [ ] Create Datadog account (or CloudWatch setup)
    • [ ] Build core dashboards (database, API, application)
    • [ ] Configure critical alerts (PagerDuty routing)
    • [ ] Test alert delivery
  3. Week 5-6: Load testing and optimization

    • [ ] Run load tests to baseline performance
    • [ ] Identify and fix bottlenecks
    • [ ] Tune connection pool sizing
    • [ ] Validate SLO targets
  4. Ongoing:

    • [ ] Weekly dashboard reviews
    • [ ] Monthly load testing (regression detection)
    • [ ] Quarterly alert tuning (reduce noise)
    • [ ] Feature-specific metric additions as features launch

References