Deployment & CI/CD
How code reaches production. Covers the pipeline (branch → PR → merge → deploy), database migrations, ECS rolling deploys, the manual approval gate for production, rollback, and operational runbooks.
This document assumes the topology in aws-infrastructure.md and the Terraform module layout in iac-layout.md. If you're looking for "how do I add a new feature to the deploy pipeline," start here. If you're looking for "what does the infrastructure actually look like," start with aws-infrastructure.md.
The contract
The pipeline contract — what's guaranteed, what's not:
git pushtomastertriggers staging deploy (after CI passes)- Production deploy requires manual approval in GitHub Actions (no auto-promote in the first 6+ months post-launch)
- Migrations run before service deploys — the new image isn't taking traffic until migrations are applied successfully
- Rolling deploys are zero-downtime — ECS spins up new tasks, waits for ALB health checks, drains old tasks
- Rollback is one-click in GitHub Actions UI; the previous image SHA stays in ECR for at least 20 deploys (lifecycle policy)
- Hooks and signing are not bypassed — pre-commit hooks run, signed commits stay signed
What's not in this contract:
- Auto-promote staging → production (deferred; revisit after 3–6 months of stable production operation)
- Canary or blue/green for the application tier (rolling deploy is enough at this scale)
- Database schema rollback (forward-only — see "Migration handling" below)
- Cross-region or multi-region deploy
End-to-end flow
Developer GitHub AWS
───────── ────── ───
git checkout -b feat/X
... code changes ...
git commit + git push
─────► PR opened
CI: make check + tests
CI: green ✓
◄───── Reviewer approves
git merge to master
─────► master push triggers
"deploy-staging" workflow
├── Build images (4 in parallel)
├── Push to ECR with sha + latest tags
│ ─────────────────────► ECR (4 repos)
├── If migrations changed:
│ Run migration ECS task ─────────► RDS (DIRECT URL)
│ Wait for completion
├── Update ECS task definitions
│ ─────────────────────► ECS Fargate
└── Trigger rolling deploy per service
─────────────────────► New tasks → ALB health → drain old tasks
Done in 5–10 min
─── Production gate ───
Reviewer clicks "Approve" on
"deploy-production" workflow
─────► Same pipeline, against
infra/envs/production
─────────────────────► Production ECS FargateBranch + PR flow
Branch protection (master)
Configured on the GitHub repository:
- Require a pull request before merging
- Require approval from at least one reviewer
- Require status checks to pass:
make check(lint + typecheck + build),make test,make test-integration - Require branches to be up to date before merging
- Require linear history (no merge commits — use squash or rebase)
- Require signed commits
- Restrict who can push to
master(only via PR merge) - Disallow force pushes
Working on a change
git checkout master && git pull
git checkout -b feat/your-change
# ... edit code ...
make check # local pre-commit
git commit -s # signed
git push -u origin feat/your-change
gh pr create --base master --title "..." --body "..."CI runs on every push to the PR branch. The PR can't merge until CI is green and a reviewer approves. Squash-and-merge produces a single commit on master per PR.
What CI runs on every PR
Defined in .github/workflows/ci.yml:
| Job | What | Required for merge |
|---|---|---|
| lint + format | pnpm format:check, pnpm lint, make lint (golangci-lint) | Yes |
| typecheck | pnpm typecheck (Next.js apps) | Yes |
| build | pnpm build + make build | Yes |
| unit tests | make test (Go race detector enabled) | Yes |
| integration tests | make test-integration (testcontainers Postgres + LocalStack S3 + ratelimit) | Yes |
| schema/classification checks | make check (includes cmd/check-classification, cmd/check-soup, cmd/check-events-registry, cmd/check-inbound-webhooks, cmd/check-cata-resolution, cmd/check-capabilities, cmd/check-softdelete, cmd/check-migrations) | Yes |
Total CI time target: under 10 minutes for the typical PR. If it grows past 15 minutes, parallelize before adding more steps.
Deploy pipeline
Defined in .github/workflows/deploy.yml. Triggered by:
- Push to
master→ deploys to staging automatically - Manual trigger via
workflow_dispatch→ deploys to production (requires approval on the GitHub environment)
Pipeline steps
# Pseudocode of the deploy workflow
steps:
- checkout
- configure-aws-credentials # OIDC federation, no long-lived access keys
role-to-assume: arn:aws:iam::ACCOUNT:role/restartix-deploy-{env}
- login-ecr
- build-images-in-parallel:
- docker build -f services/api/deploy/Dockerfile.api → restartix-core-api
- docker build -f services/api/deploy/Dockerfile.telemetry → restartix-telemetry-api
- docker build apps/clinic → restartix-clinic
- docker build apps/portal → restartix-portal
- docker build apps/console → restartix-console
- push-to-ecr:
tag each as $sha and latest
- detect-migrations:
diff services/api/migrations/ between previous deploy SHA and current
if changed → set need_migrations=true
- run-migrations (if need_migrations):
aws ecs run-task \
--cluster restartix-{env} \
--task-definition restartix-migrations:latest \
--overrides '{"containerOverrides":[{"name":"migrate","command":["migrate-up"]}]}' \
--launch-type FARGATE \
--network-configuration ...
wait for completion (poll task status)
fail the pipeline if migrations exit non-zero
- update-ecs-services-in-parallel:
for each of (core-api, telemetry-api, clinic, portal, console):
register new task definition revision pointing at $sha image
aws ecs update-service ... --task-definition NEW_REVISION
- wait-for-rolling-deploys:
poll service deployment status until all show PRIMARY task set is steady
- smoke-test:
curl https://{env}.restartix.pro/health → expect 200
run a small synthetic acceptance script (sign in, list orgs, etc.)
- report:
slack notification with deploy SHA + durationProduction approval gate
The deploy-production workflow uses a GitHub Environment named production configured to require approval from a designated reviewer set before any job can run. The same SHA that just deployed to staging is the deploy candidate — the reviewer approves the promotion, not a fresh build.
# .github/workflows/deploy-production.yml
jobs:
deploy:
environment:
name: production
url: https://app.restartix.pro
# GitHub blocks here until a reviewer clicks "Approve"
runs-on: ubuntu-latest
steps: ...Database migrations
Migrations live in services/api/migrations/core/ and are applied with golang-migrate.
How migrations run during deploy
- The deploy workflow diffs
services/api/migrations/core/between the last-deployed SHA and the current SHA - If anything changed, it runs the migration as a one-shot ECS task
- The migration task uses
DATABASE_DIRECT_URL(RDS / Aurora cluster endpoint, bypasses pgbouncer) - The new application image isn't deployed until migrations succeed
Why DIRECT_URL: golang-migrate uses pg_advisory_lock to serialize migration runs across deploying instances. Advisory locks are session features; pgbouncer in transaction-pool mode would release them mid-migration.
Migration discipline (forward-only)
- Migrations are forward-only. Down migrations exist for local dev (
make migrate-down) but are not run in staging or production. - A bad migration is fixed by writing a new migration, not by reverting the bad one. Once data has been written under a schema change, reverting the schema is data loss.
- Pre-production phase exception: see CLAUDE.md → "Migrations are editable pre-production." Until first production deploy, edit the original
CREATE TABLEmigration rather than stacking ALTERs.
Migration safety checklist (manual review on every PR with schema changes)
- [ ] Adding a column with a non-null default on a large table → use a separate ALTER ADD COLUMN (nullable) → backfill in a separate migration → ALTER SET NOT NULL. Locking pattern matters.
- [ ] Adding indexes on large tables → use
CREATE INDEX CONCURRENTLY(golang-migrate supports this with a directive) - [ ] Renaming columns → use the expand-contract pattern (add new column, dual-write, backfill, switch reads, drop old)
- [ ] Dropping columns → ensure no application code reads them; landed in a separate PR after the read paths are removed
- [ ] Foreign-key changes → review for downtime risk on large referenced tables
Rolling deploys
ECS rolling deploys are configured per service in Terraform:
resource "aws_ecs_service" "core_api" {
...
deployment_minimum_healthy_percent = 100 # never go below current task count
deployment_maximum_percent = 200 # spin up to 2× during deploy
deployment_circuit_breaker {
enable = true
rollback = true # auto-rollback if new tasks fail health checks
}
}The mechanic:
- New task definition revision registered (image tag updated)
- ECS spins up new tasks (up to
maximum_percentof desired count) - New tasks register with the ALB target group
- ALB runs health checks against
/healthon each new task - Once new tasks are healthy, old tasks are deregistered from the target group (drained)
- Old tasks receive
SIGTERM; the Go app's graceful shutdown handler (30s timeout) runs - Old tasks exit; deploy is complete
For a 2-task service: 5–8 minutes end-to-end. For a 6-task service: 8–12 minutes.
Circuit breaker
If new tasks repeatedly fail health checks, ECS's deployment circuit breaker auto-rolls back to the previous task definition revision. The deploy workflow surfaces this as a failed job; alarms fire.
Rollback
Three flavors, in order of preference:
Rollback A: redeploy a previous image SHA (fastest, no code change)
The "Deploy" workflow accepts an optional image_sha input. Pass the SHA of the last known-good deploy (visible in the workflow run history). The workflow updates ECS task definitions to point at that older image and triggers a rolling deploy. Same migration-safety story applies — if the older image's migrations have been advanced past, they're already applied; the rollback runs against the newer schema (which is why migrations need to be forward-compatible per the discipline above).
gh workflow run deploy-staging.yml -f image_sha=abc123def
# or for production (still requires approval gate):
gh workflow run deploy-production.yml -f image_sha=abc123defRollback B: revert the offending PR
Use when the issue is a code regression and rolling back via image SHA isn't enough (the same SHA that broke staging would also break a re-deploy). Open a revert PR, merge, normal pipeline applies.
Rollback C: emergency manual ECS update
Used only in active incidents when GitHub Actions itself is unavailable or when the workflow can't execute fast enough:
# Find the previous task definition revision
aws ecs list-task-definitions \
--family-prefix restartix-core-api \
--status ACTIVE \
--sort DESC | head -5
# Update the service to point at it
aws ecs update-service \
--cluster restartix-production \
--service restartix-core-api \
--task-definition restartix-core-api:PREVIOUS_REVISION \
--force-new-deploymentThis bypasses the approval gate and is a break-glass action. Audit it explicitly afterwards (log the incident in the operations channel, capture the SHAs involved).
First-time setup
These steps are run once per AWS account when bootstrapping the deploy pipeline. They live in Terraform in infra/modules/deploy/ and infra/envs/{env}/deploy.tf.
GitHub Actions OIDC federation (no long-lived AWS keys)
resource "aws_iam_openid_connect_provider" "github" {
url = "https://token.actions.githubusercontent.com"
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = ["..."] # GitHub's OIDC thumbprint
}
resource "aws_iam_role" "deploy" {
name = "restartix-deploy-${var.env}"
assume_role_policy = data.aws_iam_policy_document.deploy_trust.json
}
data "aws_iam_policy_document" "deploy_trust" {
statement {
principals {
type = "Federated"
identifiers = [aws_iam_openid_connect_provider.github.arn]
}
actions = ["sts:AssumeRoleWithWebIdentity"]
condition {
test = "StringEquals"
variable = "token.actions.githubusercontent.com:sub"
values = ["repo:restartix/restartix-platform:ref:refs/heads/master"]
}
}
}The deploy role's permissions are scoped to ECR push, ECS update-service, ECS RunTask (for migrations), CloudWatch Logs read (for diagnostics), and Secrets Manager read on restartix/{env}/*. Nothing else.
No AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in GitHub Secrets — the OIDC token is exchanged for short-lived STS credentials at the start of each workflow run.
GitHub Environment for production approval gate
Configured in the GitHub repo settings → Environments → production:
- Required reviewers: list of designated reviewer GitHub usernames
- Wait timer: 0 (no forced delay; reviewer approves when ready)
- Deployment branches:
masteronly (production deploys cannot run from feature branches)
Common operational tasks
psql access via SSM Session Manager
No bastion EC2, no SSH keys. Connect via AWS Systems Manager Session Manager port forwarding through a long-running task:
# Find a Core API task ID (any one is fine — all live in the same private subnet)
aws ecs list-tasks \
--cluster restartix-production \
--service-name restartix-core-api \
--query 'taskArns[0]' --output text
# Start port forwarding from RDS port 5432 to your local 15432
aws ssm start-session \
--target ecs:restartix-production_TASK_ID_RUNTIME_ID \
--document-name AWS-StartPortForwardingSessionToRemoteHost \
--parameters '{"host":["RDS_WRITER_ENDPOINT"],"portNumber":["5432"],"localPortNumber":["15432"]}'
# In another terminal: connect with psql
psql "postgresql://OPS_USER@localhost:15432/restartix?sslmode=require"
# Password retrieved from Secrets Manager separatelyAudit-logged via CloudTrail. Read-only ops should use the read-only role; never use the migration role for ad-hoc queries.
Tail production logs
aws logs tail /ecs/restartix-production/core-api --follow
aws logs tail /ecs/restartix-production/clinic --follow --filter-pattern "ERROR"Take a manual RDS snapshot before a risky migration
aws rds create-db-snapshot \
--db-instance-identifier restartix-production \
--db-snapshot-identifier "pre-migration-$(date +%Y%m%d-%H%M%S)"Manual snapshots persist until explicitly deleted. Always do this before a schema change you're nervous about.
Force a service to redeploy with the same image
aws ecs update-service \
--cluster restartix-production \
--service restartix-core-api \
--force-new-deploymentUseful for: picking up rotated Secrets Manager values, recovering from a stuck task, validating health-check changes.
Runbooks
Runbook: deploy stuck in "in progress"
Symptom: GitHub Actions workflow has been waiting on the rolling deploy step for >15 minutes.
Diagnose:
# Check the service deployment status
aws ecs describe-services \
--cluster restartix-{env} \
--services restartix-core-api \
--query 'services[].deployments'
# Look at the events stream — failures usually show here
aws ecs describe-services \
--cluster restartix-{env} \
--services restartix-core-api \
--query 'services[].events[:10]'
# Check recent task failures
aws ecs list-tasks \
--cluster restartix-{env} \
--service-name restartix-core-api \
--desired-status STOPPED
# For a stopped task, get the stop reason
aws ecs describe-tasks \
--cluster restartix-{env} \
--tasks TASK_ID \
--query 'tasks[].[stoppedReason,containers[].reason]'Common causes:
- Health check failing (new image broken, dependency down, env var misconfigured)
- Insufficient capacity in the AZ (Fargate Spot evictions in staging)
- IAM role missing a required permission
- Secret missing or unreadable
Resolve:
- If deploy circuit breaker has rolled back, the deploy workflow shows failed; investigate the regression
- If still in progress and tasks are healthy, wait — drain takes time
- If genuinely stuck, manually update-service with
--force-new-deploymentto retry
Runbook: migration failed mid-deploy
Symptom: The deploy workflow's "run-migrations" step exited non-zero. The new application image was not deployed.
Diagnose:
# Find the migration task
aws ecs list-tasks \
--cluster restartix-{env} \
--family restartix-migrations \
--desired-status STOPPED
# Get logs
aws logs tail /ecs/restartix-{env}/migrations --since 30mResolve:
- Read the migration error. Common: SQL syntax in the new migration, foreign-key violation, lock timeout on a large table.
- The application is still on the previous image — service is unaffected
- Fix the migration in a new commit (forward-only — never edit the failed migration's SQL after it has partially applied)
- Re-run the deploy workflow
Runbook: rolling deploy fails health checks
Symptom: New tasks started but never went healthy; deployment circuit breaker rolled back.
Diagnose:
# Get the new task ARN that failed
aws ecs describe-services ... --query 'services[].deployments[].failedTasks'
# Check container exit code and logs
aws ecs describe-tasks --tasks TASK_ARN
aws logs tail /ecs/restartix-{env}/core-api --since 10mCommon causes:
- App can't reach RDS / Redis (security group misconfig, secret rotation went wrong)
- Migration was applied but app still expects old schema (rare — usually the deploy ordering prevents this)
- Image is missing a runtime dependency (binary not built correctly)
Resolve:
- Circuit breaker already rolled back; service is on the previous task definition
- Fix the underlying issue
- Re-deploy
Runbook: emergency hotfix to production
When: A production-only bug needs immediate fixing and the standard PR-and-approval flow is too slow.
Steps:
- Branch off
master(not from staging — staging may have unreviewed work) - Make the minimal fix
- Open a PR, mark it
urgent, request review from on-call reviewer - Once reviewed + CI green, merge
- Trigger the production deploy workflow manually (the approval gate still applies — but the same reviewer who approved the PR can approve the deploy)
- Watch the deploy
- Post-incident, write a brief postmortem in the operations log: what broke, what was changed, what to monitor going forward
Runbook: launch-day legacy-data migration
Context: RestartiX replaces a legacy product with 20k+ users / 11k+ treatment plans / 5k+ active subscriptions. On launch day, this data migrates over.
Pre-migration:
- Production AWS environment is built and validated against the staging acceptance test list
- Application is in maintenance mode (return 503 for all requests, branded maintenance page served by Cloudflare)
- A dry run of the migration has been completed against staging using a recent snapshot of legacy data
Migration steps:
# 1. Take a manual RDS snapshot of the (empty) production DB as a known-clean starting point
aws rds create-db-snapshot \
--db-instance-identifier restartix-production \
--db-snapshot-identifier "pre-launch-$(date +%Y%m%d-%H%M%S)"
# 2. Pull the legacy database dump (source: existing legacy product host)
pg_dump $LEGACY_DATABASE_URL --format=custom --no-owner > legacy-data.pgdump
# 3. Run the data-transform pipeline (legacy schema → restartix schema). This is a one-time
# Go binary not in the standard pipeline; lives at services/migration-tools/legacy-import/.
# It reads the legacy dump, transforms records, writes via the standard repo layer (so RLS
# and audit log work normally).
go run ./services/migration-tools/legacy-import \
--source legacy-data.pgdump \
--target $DATABASE_DIRECT_URL
# 4. Validate row counts against expected targets
psql "$DATABASE_DIRECT_URL" -c "SELECT count(*) FROM organizations"
psql "$DATABASE_DIRECT_URL" -c "SELECT count(*) FROM humans"
psql "$DATABASE_DIRECT_URL" -c "SELECT count(*) FROM organization_subscriptions WHERE status='active'"
# 5. Verify a few legacy users can sign in via the production app
# 6. Take another manual snapshot — the loaded but pre-traffic state
aws rds create-db-snapshot \
--db-instance-identifier restartix-production \
--db-snapshot-identifier "post-import-$(date +%Y%m%d-%H%M%S)"
# 7. Remove maintenance mode (Cloudflare rule update)
# 8. Watch monitoring for 24hRollback plan (if data import was wrong):
- Restore RDS from the
pre-launch-...snapshot - Re-enable maintenance mode
- Fix the import
- Re-run
Keep the legacy database operational for 7 days post-launch as a safety net. Then decommission.
Related documentation
- AWS infrastructure — full topology and cost
- IaC layout — Terraform modules and where the deploy IAM role lives
- Scaling architecture — task sizing and auto-scaling parameters
- Backup & DR — manual snapshots, PITR, restore runbooks
- Monitoring — alarms that fire during a bad deploy
- Decisions — why ECS Fargate, why Terraform, why manual approval gate