Monitoring & Observability
Production-grade observability strategy for RestartiX Platform API — from single-instance AWS App Runner deployment to multi-shard enterprise infrastructure.
Why Monitoring Matters for This Architecture
RestartiX Platform uses Row-Level Security (RLS), which means every request holds a database connection for its entire duration. Unlike traditional connection pooling where connections are briefly borrowed and returned, our architecture requires careful monitoring to prevent connection pool exhaustion.
Critical risks without monitoring:
- Connection pool exhaustion → cascading failures
- Slow queries → blocked connections → no capacity for new requests
- Replication lag (Phase 2+) → stale data reads
- Silent performance degradation → poor user experience
Critical Metrics
Database Connection Pool
The most critical metric for this architecture. Connection pool exhaustion is the primary scaling bottleneck.
| Metric | Target | Alert Threshold | Action |
|---|---|---|---|
| Connection pool wait count | 0 | > 0 for 1+ min | Immediate: Investigate long-running queries, consider scaling |
| Active DB connections | < 80% of max | > 80% of max | Warning: Monitor closely, plan capacity increase |
| Idle connections | > 20% of pool | < 10% of pool | Pool undersized or requests not releasing connections |
| Connection pool utilization | < 70% | > 80% | Plan for read replicas (Phase 2) or connection tuning |
Why this matters:
- RLS requires holding a connection for the entire request
- Pool exhaustion = all new requests fail immediately
- No connection = no query = no response = 503 errors
Pool size by phase:
| Phase | Infrastructure | Max Connections | Reserved for App |
|---|---|---|---|
| Phase 1 | AWS RDS PostgreSQL | 100 | 90 (10 for monitoring/admin) |
| Phase 2 | AWS RDS (primary) | 200 | 180 (20 reserved) |
| Phase 2 | AWS RDS (replicas, each) | 200 | 190 (10 reserved) |
| Phase 3+ | Enterprise shard | 500 | 480 (20 reserved) |
Query Performance
| Metric | Target | Alert Threshold | Phase |
|---|---|---|---|
| Query latency (p50) | < 50ms | > 200ms | All |
| Query latency (p95) | < 200ms | > 500ms | All |
| Query latency (p99) | < 500ms | > 1s | All |
| Slow query count | 0 | > 10/min | All |
| Query timeout rate | 0% | > 0.1% | All |
Slow query definition: Any query taking > 500ms
What to investigate:
- Missing indexes on
organization_id(required for RLS) - N+1 queries (multiple queries in loops)
- Full table scans on large tables
- Complex joins without proper indexing
- Segment rule evaluation on large patient sets
Request Performance
| Metric | Target | Alert Threshold | Phase |
|---|---|---|---|
| Request latency (p50) | < 100ms | > 300ms | All |
| Request latency (p95) | < 300ms | > 500ms | All |
| Request latency (p99) | < 500ms | > 1s | All |
| Error rate (4xx) | < 5% | > 10% | All |
| Error rate (5xx) | < 0.1% | > 1% | All |
| Timeout rate | < 0.1% | > 1% | All |
Database Health
| Metric | Target | Alert Threshold | Phase |
|---|---|---|---|
| Database size | N/A | > 80% of allocated | All |
| Replication lag | < 1s | > 5s | Phase 2+ |
| Replica health | All healthy | Any replica down | Phase 2+ |
| Transaction rate | N/A | > 10k/sec | Phase 3+ |
| Dead tuples | < 5% | > 10% | All |
Application Health
| Metric | Target | Alert Threshold | Phase |
|---|---|---|---|
| CPU usage | < 60% | > 80% | All |
| Memory usage | < 70% | > 85% | All |
| Goroutines | < 1000 | > 5000 | All |
| Heap allocations | < 500MB | > 1GB | All |
| GC pause time (p99) | < 10ms | > 50ms | All |
Feature-Specific Metrics
| Feature | Metric | Target | Alert Threshold |
|---|---|---|---|
| Segments | Rule evaluation time | < 100ms | > 500ms |
| Segments | Patient count calculation | < 200ms | > 1s |
| Webhooks | Delivery success rate | > 95% | < 90% |
| Webhooks | Delivery latency (p95) | < 5s | > 30s |
| Webhooks | Retry queue depth | < 100 | > 500 |
| Forms | PDF generation time | < 2s | > 10s |
| Documents | S3 upload time (p95) | < 1s | > 5s |
| Scheduling | Availability calculation | < 200ms | > 1s |
| Scheduling | Hold expiration accuracy | 100% | < 99.9% |
| Audit (local) | Write throughput (inserts to audit_log) | > 1000/sec | < 100/sec |
| Audit (local) | Write latency (p95) | < 5ms | > 20ms |
| Audit (forwarding) | Telemetry forwarding lag (received_at - created_at) | < 30s | > 5min |
| Audit (forwarding) | Queue depth (Redis audit queue) | < 100 | > 500 |
| Audit (forwarding) | Forward error rate | < 0.1% | > 1% |
Service Level Objectives (SLOs)
API Availability
| Tier | SLO | Monthly Downtime Allowance |
|---|---|---|
| Shared (Phase 1-2) | 99.5% | 3h 37min |
| Shared (Phase 3) | 99.9% | 43min 50sec |
| Enterprise | 99.95% | 21min 55sec |
API Latency
| Endpoint Category | p95 Target | p99 Target |
|---|---|---|
| Read (simple) | < 100ms | < 200ms |
| Read (complex) | < 300ms | < 500ms |
| Write (simple) | < 200ms | < 400ms |
| Write (complex) | < 500ms | < 1s |
| Export/Report | < 5s | < 10s |
Endpoint categories:
- Simple read: GET single resource by ID
- Complex read: List with filtering, joins, or segment evaluation
- Simple write: Create/update single resource
- Complex write: Multi-step operations (appointment booking, form submission with file uploads)
- Export/Report: PDF generation, CSV export, analytics queries
Data Durability
| Data Type | RPO (Recovery Point Objective) | RTO (Recovery Time Objective) |
|---|---|---|
| Database (primary) | 5 minutes | 1 hour |
| Database (replica failover) | 0 (synchronous replication) | 5 minutes |
| File storage (S3) | 0 (11 9's durability) | Immediate |
| Audit logs | 0 (dual write: local + Telemetry) | Immediate |
Observability Stack
Architecture
Application Layer
├─ Go slog (structured JSON logging)
├─ pgx connection pool stats
├─ HTTP middleware metrics
└─ Custom business metrics
↓
CloudWatch Logs
↓
Aggregation & Analysis
├─ CloudWatch (primary, AWS-native)
├─ Datadog (alternative for advanced features)
└─ Grafana Loki + Prometheus (self-hosted option)
↓
Dashboards + Alerts
├─ Real-time dashboards
├─ Alert routing (PagerDuty, Slack)
└─ Incident trackingImplementation: Connection Pool Monitoring
File: internal/observability/pool_metrics.go
See immediate-actions.md for full implementation.
Key capabilities:
- Logs pool stats every 30 seconds
- Automatic alerts on utilization > 80%
- Error logs when wait count > 0 (immediate attention required)
- Exports metrics to Datadog/CloudWatch via StatsD
Usage:
// cmd/api/main.go
poolMetrics := observability.NewPoolMetrics(db, "primary")
poolMetrics.Start()
defer poolMetrics.Stop()
// For read replicas (Phase 2+)
for i, replica := range readReplicas {
metrics := observability.NewPoolMetrics(replica, fmt.Sprintf("replica-%d", i+1))
metrics.Start()
defer metrics.Stop()
}Implementation: Query Performance Tracing
File: internal/middleware/query_tracer.go
See immediate-actions.md for full implementation.
Key capabilities:
- Logs all queries > 500ms with SQL preview
- Critical alerts for queries > 5s
- Tracks request context (request ID, path, user)
- Integrates with pgx tracer hooks
Implementation: Request Timeout Middleware
File: internal/middleware/query_timeout.go
See immediate-actions.md for full implementation.
Timeout values:
- Default: 30 seconds (all requests)
- Long operations: 2 minutes (exports, PDF generation, complex reports)
- Query-level: 5 seconds (individual queries)
- Long queries: 30 seconds (analytics, aggregations)
Why this matters:
- Prevents runaway queries from exhausting connection pool
- Ensures requests fail fast instead of hanging indefinitely
- Provides clear error messages to clients
Implementation: Health Checks
File: internal/health/handler.go
See immediate-actions.md for full implementation.
Health check endpoint: GET /health
Response format:
{
"status": "healthy", // or "degraded", "unhealthy"
"checks": {
"postgresql": {
"status": "healthy",
"metrics": {
"total_conns": 45,
"acquired_conns": 30,
"idle_conns": 15,
"max_conns": 90,
"utilization_pct": 50.0
}
},
"redis": {
"status": "healthy"
}
},
"uptime_seconds": 86400
}App Runner health check configuration:
App Runner health checks are configured in the service definition:
{
"healthCheckConfiguration": {
"protocol": "HTTP",
"path": "/health",
"interval": 10,
"timeout": 5,
"healthyThreshold": 1,
"unhealthyThreshold": 3
}
}Datadog Dashboards
Dashboard 1: Database Performance
Widgets:
Connection Pool Utilization (time series)
- Metric:
postgres.pool.utilization_pct - Alert line at 80%
- By pool name (primary, replica-1, replica-2)
- Metric:
Connection Pool Wait Count (time series)
- Metric:
postgres.pool.wait_count - Alert line at 0 (any waits = problem)
- Metric:
Active vs Idle Connections (stacked area)
- Metrics:
postgres.pool.acquired_conns,postgres.pool.idle_conns - Shows pool usage distribution
- Metrics:
Query Latency Distribution (heatmap)
- Metric:
query.duration_ms - Buckets: 0-50ms, 50-100ms, 100-200ms, 200-500ms, 500ms-1s, 1s+
- Metric:
Slow Query Count (time series)
- Metric:
query.slow_count(queries > 500ms) - Alert line at 10/min
- Metric:
Replication Lag (time series, Phase 2+)
- Metric:
postgres.replication_lag_seconds - Alert line at 5s
- Metric:
Database Size Growth (line)
- Metric:
postgres.database_size_bytes - Projected to 80% threshold
- Metric:
Dashboard 2: API Performance
Widgets:
Request Latency (p50, p95, p99) (multi-line time series)
- Metric:
http.request.duration_ms - Percentiles: 50th, 95th, 99th
- Alert lines at targets
- Metric:
Request Rate (time series)
- Metric:
http.request.count - By method (GET, POST, PATCH, DELETE)
- Metric:
Error Rates (stacked area)
- Metrics:
http.response.4xx,http.response.5xx - Alert line at 1% for 5xx
- Metrics:
Endpoint Performance (top list)
- Metric:
http.request.duration_ms(p95) - Grouped by endpoint
- Sorted by slowest
- Metric:
Timeout Rate (time series)
- Metric:
http.request.timeout_count - Alert line at 0.1%
- Metric:
Active Requests (gauge)
- Metric:
http.request.active - Shows current load
- Metric:
Dashboard 3: Application Health
Widgets:
CPU Usage (time series)
- Metric:
system.cpu.usage_pct - Alert line at 80%
- Metric:
Memory Usage (time series)
- Metric:
system.memory.usage_pct - Alert line at 85%
- Metric:
Goroutine Count (time series)
- Metric:
go.goroutines - Alert line at 5000
- Metric:
Heap Allocations (time series)
- Metric:
go.heap.alloc_bytes - Alert line at 1GB
- Metric:
GC Pause Time (p99) (time series)
- Metric:
go.gc.pause_ns - Alert line at 50ms
- Metric:
Error Log Rate (time series)
- Metric:
log.error.count - Shows application errors
- Metric:
Dashboard 4: Feature-Specific Metrics
Widgets:
Segment Evaluation Performance (histogram)
- Metric:
segment.evaluation.duration_ms - Target: < 100ms
- Metric:
Webhook Delivery Success Rate (gauge)
- Metric:
webhook.delivery.success_rate - Target: > 95%
- Metric:
Webhook Retry Queue Depth (time series)
- Metric:
webhook.retry.queue_depth - Alert line at 500
- Metric:
Form PDF Generation Time (histogram)
- Metric:
form.pdf.generation_ms - Target: < 2s
- Metric:
S3 Upload Performance (time series, p95)
- Metric:
s3.upload.duration_ms - Target: < 1s
- Metric:
Audit Log Write Throughput (time series)
- Metric:
audit_log.write.count - Shows writes/sec
- Metric:
Scheduling Availability Calculation (histogram)
- Metric:
scheduling.availability.duration_ms - Target: < 200ms
- Metric:
CloudWatch Dashboards
CloudWatch is the primary monitoring solution for our AWS infrastructure. See AWS Infrastructure for the full setup.
Custom Metrics Namespace
RestartiX/CoreAPI
Metrics to push:
- Connection pool stats (via CloudWatch PutMetricData API)
- Query performance (from query tracer)
- Request metrics (from HTTP middleware)
- Application metrics (goroutines, memory, GC)
CloudWatch Logs Insights Queries
Find slow queries:
fields @timestamp, path, method, duration_ms, sql
| filter duration_ms > 500
| sort duration_ms desc
| limit 50Connection pool exhaustion events:
fields @timestamp, pool, utilization_pct, wait_count
| filter wait_count > 0
| sort @timestamp descError rate by endpoint:
fields @timestamp, path, status_code
| filter status_code >= 500
| stats count() by path
| sort count descAlert Configuration
Critical Alerts (PagerDuty)
These require immediate action, typically within 15 minutes.
1. Connection Pool Exhaustion
name: "PostgreSQL Connection Pool Exhaustion"
type: metric alert
query: "avg(last_5m):avg:postgres.pool.wait_count{env:prod} > 0"
severity: critical
notification:
- "@pagerduty-engineering"
- "@slack-alerts"
message: |
CRITICAL: Connection pool experiencing waits. Requests are being delayed.
Immediate actions:
1. Check active connections: SELECT count(*) FROM pg_stat_activity
2. Identify long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '30 seconds'
3. Consider emergency connection pool increase or read replica routing
Runbook: https://docs.restartix.internal/runbooks/connection-pool-exhaustion2. Database Unreachable
name: "Database Connection Failed"
type: service check
query: "postgres.can_connect"
severity: critical
notification:
- "@pagerduty-on-call"
- "@slack-critical"3. High Error Rate
name: "API Error Rate > 5%"
type: metric alert
query: "sum(last_5m):sum:http.response.5xx{env:prod}.as_count() / sum:http.response.total{env:prod}.as_count() > 0.05"
severity: critical
notification:
- "@pagerduty-engineering"Warning Alerts (Slack)
These require attention but not immediate action.
1. Connection Pool High Utilization
name: "Connection Pool > 80%"
type: metric alert
query: "avg(last_10m):avg:postgres.pool.utilization_pct{env:prod} > 80"
severity: warning
notification:
- "@slack-alerts"
message: |
WARNING: Connection pool utilization high. Monitor for potential exhaustion.
Next steps:
1. Review query performance dashboard for slow queries
2. Check request latency trends
3. Plan for scaling if trend continues (read replicas or pool size increase)2. Slow Query Volume
name: "High Volume of Slow Queries"
type: log alert
query: 'logs("slow_query").rollup("count").last("10m") > 50'
severity: warning
notification:
- "@slack-engineering"
message: |
High volume of slow queries (> 500ms) detected.
Actions:
1. Review query tracer logs for patterns
2. Check for missing indexes
3. Consider query optimization or caching3. Replication Lag (Phase 2+)
name: "Replication Lag > 5 seconds"
type: metric alert
query: "max(last_5m):max:postgres.replication_lag_seconds{env:prod} > 5"
severity: warning
notification:
- "@slack-alerts"
message: |
Replication lag exceeding 5 seconds. Read replicas may serve stale data.
Actions:
1. Check primary database load
2. Verify network connectivity between primary and replicas
3. Consider temporary routing of reads to primary if lag persistsInfo Alerts (Slack)
1. Database Size Growth
name: "Database Approaching 80% Capacity"
type: metric alert
query: "avg(last_1h):avg:postgres.database_size_bytes{env:prod} > 800000000000" # 800GB
severity: info
notification:
- "@slack-engineering"
message: |
Database size approaching 80% of allocated capacity (1TB).
Planning required:
1. Review data retention policies
2. Plan migration to larger instance
3. Consider audit log partitioning/archival
4. Evaluate if Phase 3 (sharding) is neededAWS App Runner & CloudWatch Monitoring
AWS App Runner and CloudWatch provide built-in monitoring for production deployments.
Metrics Available
Via AWS CloudWatch Console:
- CPU usage (%)
- Memory usage (MB)
- Request count and latency
- Active instances
- HTTP 2xx/4xx/5xx response counts
Via AWS CLI:
# View logs in real-time
aws logs tail /ecs/restartix-core-api --follow
# View App Runner service metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/AppRunner \
--metric-name CPUUtilization \
--dimensions Name=ServiceName,Value=restartix-platform \
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
# Check service status
aws apprunner describe-service --service-arn <service-arn>CloudWatch Alerts
CloudWatch provides native alerting via CloudWatch Alarms.
Option 1: CloudWatch Alarms (recommended)
# Create alarm for high error rate
aws cloudwatch put-metric-alarm \
--alarm-name restartix-platform-5xx-errors \
--metric-name 5xxCount \
--namespace AWS/AppRunner \
--statistic Sum \
--period 300 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions <sns-topic-arn>Option 2: Health check monitoring (external)
- Use UptimeRobot or Pingdom to monitor
/healthendpoint - Configure alerts for downtime or degraded status
Option 3: Custom monitoring script
#!/bin/bash
# cloudwatch-monitor.sh - Run every 5 minutes via cron
HEALTH_URL="https://api.restartix.com/health"
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
response=$(curl -s -w "%{http_code}" -o /tmp/health.json $HEALTH_URL)
if [ "$response" != "200" ]; then
curl -X POST $SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{
"text": "Health check failed: HTTP '$response'",
"attachments": [{
"color": "danger",
"text": "'"$(cat /tmp/health.json)"'"
}]
}'
fiIncident Response Procedures
Runbook: Connection Pool Exhaustion
Symptoms:
postgres.pool.wait_count> 0- Requests timing out (504 Gateway Timeout)
- Health check returning "degraded" or "unhealthy"
Investigation:
- Check active connections:
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'restartix_platform'
GROUP BY state;- Find long-running queries:
SELECT pid, usename, state, query_start, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '30 seconds'
ORDER BY duration DESC;- Check query wait events:
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;Resolution:
Immediate (< 5 minutes):
- Kill long-running queries if identified as non-critical:sql
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = <PID>; - Temporarily increase connection pool size (if headroom available):bash
# Update DB_POOL_MAX_CONNS in AWS Secrets Manager, then redeploy aws secretsmanager update-secret --secret-id restartix-prod/env ... git push origin main # GitHub Actions handles the rest
Short-term (< 1 hour):
- Add query timeouts if not already configured (see immediate-actions.md)
- Optimize identified slow queries (add indexes, rewrite queries)
- Enable read replica routing for read-heavy endpoints (Phase 2+)
Long-term (< 1 week):
- Migrate to Phase 2 infrastructure (AWS RDS with read replicas)
- Implement aggressive query result caching (Redis)
- Review endpoint patterns for N+1 queries and optimize
Runbook: High Error Rate (5xx)
Symptoms:
http.response.5xxspike- User complaints about "Something went wrong" errors
- Datadog/CloudWatch alert
Investigation:
- Check error logs:
aws logs filter-log-events \
--log-group-name /ecs/restartix-core-api \
--filter-pattern "?error ?fatal ?ERROR ?FATAL" \
--start-time $(date -d '1 hour ago' +%s000)- Group errors by type:
fields @timestamp, error, path, user_id
| filter level = "ERROR"
| stats count() by error
| sort count desc- Check database connectivity:
curl https://api.restartix.com/healthResolution:
Database connection errors:
- Check RDS database status
- Verify connection string in environment variables
- Check for network issues (security groups, VPC config)
Application errors:
- Review recent deployments (rollback if needed)
- Check for panic/crash logs
- Verify external service availability (Clerk, Daily.co, S3)
Capacity issues:
- Check CPU/memory usage (may need vertical scaling)
- Review goroutine count (potential goroutine leak)
- Check connection pool status
Runbook: Slow Response Times
Symptoms:
http.request.duration_ms(p95/p99) elevated- User complaints about "slow" pages
- Timeout alerts
Investigation:
- Identify slow endpoints:
fields @timestamp, path, method, duration_ms
| filter duration_ms > 1000
| stats avg(duration_ms), count() by path
| sort avg desc- Check for slow queries:
fields @timestamp, sql, duration_ms
| filter duration_ms > 500
| sort duration_ms desc
| limit 20- Check database performance:
- Connection pool utilization
- Replication lag (Phase 2+)
- Database CPU/memory usage
Resolution:
Slow queries:
- Add missing indexes (see database-schema.md)
- Rewrite inefficient queries
- Add query result caching (Redis)
High load:
- Scale horizontally (add the Core API instances)
- Scale database vertically (larger RDS instance)
- Enable read replica routing (Phase 2+)
External service latency:
- Check Daily.co API response times
- Check S3 upload/download performance
- Implement circuit breakers for external calls
Monitoring Best Practices
1. Structured Logging
Use Go's slog package for all logging:
import "log/slog"
// Good: Structured with context
slog.Info("appointment created",
"appointment_id", appt.ID,
"organization_id", appt.OrganizationID,
"specialist_id", appt.SpecialistID,
"duration_ms", time.Since(start).Milliseconds(),
)
// Bad: Unstructured string interpolation
log.Printf("Created appointment %d for org %d", appt.ID, appt.OrganizationID)2. Request Context Propagation
Always pass request context through the call stack:
// Good: Context propagation
func (s *Service) Create(ctx context.Context, req *CreateRequest) (*Appointment, error) {
conn := database.ConnFromContext(ctx)
// ... use conn with context ...
}
// Bad: No context
func (s *Service) Create(req *CreateRequest) (*Appointment, error) {
// How do you timeout? How do you trace?
}3. Metric Naming Conventions
Use consistent metric naming:
<namespace>.<entity>.<metric>.<unit>
Examples:
- postgres.pool.utilization_pct
- http.request.duration_ms
- segment.evaluation.duration_ms
- webhook.delivery.success_rate4. Alert Fatigue Prevention
Good alert characteristics:
- Actionable (clear next step)
- Specific (not "something is wrong")
- Contextual (includes relevant data)
- Rare (< 1/week for warnings, < 1/month for info)
Bad alerts:
- "CPU usage > 50%" (too frequent, not actionable)
- "Error occurred" (too vague)
- "Database size growing" (without threshold context)
5. Dashboard Organization
Operational dashboards (for on-call):
- Real-time metrics (5-second refresh)
- Focus on SLOs and critical alerts
- Clear visual indicators (red/yellow/green)
Strategic dashboards (for planning):
- Longer time ranges (7-day, 30-day trends)
- Capacity planning metrics
- Cost analysis
Feature-specific dashboards (for developers):
- Deep-dive into specific subsystems
- Correlated metrics (e.g., webhook delivery + retry queue)
- A/B test results, feature flag rollout metrics
Cost Optimization
Datadog Cost Management
Datadog pricing is based on:
- Hosts (per instance)
- Custom metrics (number of unique metric names)
- Log ingestion (GB/month)
Optimization strategies:
Reduce log volume:
- Sample debug logs (e.g., 10% sampling)
- Exclude health check logs
- Use log patterns instead of storing every log
Consolidate metrics:
- Use tags instead of separate metrics
- Example:
http.request.duration_ms{endpoint:/appointments}instead ofhttp.request.appointments.duration_ms
Use metric rollups:
- Keep high-resolution data for 7 days
- Aggregate to 1-minute resolution after 7 days
- Aggregate to 1-hour resolution after 30 days
Estimated Datadog cost by phase:
| Phase | Hosts | Custom Metrics | Logs (GB/mo) | Monthly Cost |
|---|---|---|---|---|
| Phase 1 | 2 (Core API + Telemetry) | 50 | 10 GB | $31/mo |
| Phase 2 | 2 + RDS monitoring | 100 | 50 GB | $200/mo |
| Phase 3 | 12 (shared + 10 enterprise) | 200 | 200 GB | $800/mo |
| Phase 4 | 50+ (multi-shard) | 500 | 1 TB | $3,000/mo |
Migration Between Monitoring Stacks
CloudWatch → Datadog (optional upgrade)
Pre-migration:
- [ ] Sign up for Datadog account
- [ ] Create API key
- [ ] Test Datadog integration in staging
Migration:
# 1. Set Datadog API key in AWS Secrets Manager
aws secretsmanager update-secret \
--secret-id restartix-prod/env \
--secret-string '{"DD_API_KEY":"<your-key>","DD_SITE":"datadoghq.com",...}'
# 2. Add Datadog agent as a sidecar or use statsd client in application code
# 3. Deploy and verify
git push origin mainPost-migration:
- [ ] Create dashboards
- [ ] Configure alerts
- [ ] Test alert routing (Slack, PagerDuty)
- [ ] Document runbooks
Datadog → CloudWatch (If reverting to AWS-native)
Pre-migration:
- [ ] Create CloudWatch Log Groups
- [ ] Configure IAM roles for metric publishing
- [ ] Test custom metric publishing from staging
Migration:
// Replace Datadog client with CloudWatch SDK
import "github.com/aws/aws-sdk-go-v2/service/cloudwatch"
// Publish custom metrics
client := cloudwatch.NewFromConfig(cfg)
client.PutMetricData(ctx, &cloudwatch.PutMetricDataInput{
Namespace: aws.String("RestartiX/CoreAPI"),
MetricData: []types.MetricDatum{
{
MetricName: aws.String("ConnectionPoolUtilization"),
Value: aws.Float64(utilizationPct),
Unit: types.StandardUnitPercent,
},
},
})Testing & Validation
Load Testing
Use k6 or hey to simulate production load:
// load-test.js (k6)
import http from "k6/http";
import { check, sleep } from "k6";
export let options = {
stages: [
{ duration: "2m", target: 100 }, // Ramp up to 100 users
{ duration: "5m", target: 100 }, // Stay at 100 users
{ duration: "2m", target: 200 }, // Ramp up to 200 users
{ duration: "5m", target: 200 }, // Stay at 200 users
{ duration: "2m", target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ["p(95)<500", "p(99)<1000"], // 95% < 500ms, 99% < 1s
http_req_failed: ["rate<0.01"], // Error rate < 1%
},
};
export default function () {
const res = http.get("https://api.restartix.com/v1/appointments", {
headers: { Authorization: "Bearer " + __ENV.API_TOKEN },
});
check(res, {
"status is 200": (r) => r.status === 200,
"response time < 500ms": (r) => r.timings.duration < 500,
});
sleep(1);
}Run load test:
k6 run --vus 200 --duration 10m load-test.jsWhat to watch during load testing:
- Connection pool utilization (should not exceed 80%)
- Query latency (p95, p99)
- Error rate
- CPU/memory usage
- Response time degradation
Chaos Testing
Simulate failures to validate monitoring and alerting:
1. Connection pool exhaustion:
# Temporarily reduce pool size in AWS Secrets Manager (staging only!)
aws secretsmanager update-secret \
--secret-id restartix-staging/env \
--secret-string '{"DB_POOL_MAX_CONNS":"10",...}'
git push origin main
# Run load test
k6 run --vus 50 --duration 5m load-test.js
# Verify alerts fired
# Verify health check shows degraded status2. Database unavailability:
# Break database connection (staging only!)
aws secretsmanager update-secret \
--secret-id restartix-staging/env \
--secret-string '{"DATABASE_URL":"postgresql://invalid:invalid@localhost/invalid",...}'
git push origin main
# Verify critical alerts
# Verify graceful degradation (503 responses, not crashes)3. Slow queries:
-- Inject artificial delay (staging only!)
SELECT pg_sleep(10);Next Steps
Week 1-2: Implement critical monitoring
- [ ] Connection pool metrics (observability/pool_metrics.go)
- [ ] Query performance tracer (middleware/query_tracer.go)
- [ ] Request timeout middleware (middleware/query_timeout.go)
- [ ] Health checks with metrics (health/handler.go)
Week 3-4: Set up dashboards and alerts
- [ ] Create Datadog account (or CloudWatch setup)
- [ ] Build core dashboards (database, API, application)
- [ ] Configure critical alerts (PagerDuty routing)
- [ ] Test alert delivery
Week 5-6: Load testing and optimization
- [ ] Run load tests to baseline performance
- [ ] Identify and fix bottlenecks
- [ ] Tune connection pool sizing
- [ ] Validate SLO targets
Ongoing:
- [ ] Weekly dashboard reviews
- [ ] Monthly load testing (regression detection)
- [ ] Quarterly alert tuning (reduce noise)
- [ ] Feature-specific metric additions as features launch
References
- immediate-actions.md - Critical monitoring implementation
- scaling-architecture.md - Infrastructure scaling and capacity planning
- 12-webhook-system.md - Webhook delivery monitoring
- 06-forms-fields-segments.md - Segment evaluation performance
- 04-auth-and-security.md - Security monitoring requirements