Deployment & CI/CD

How code reaches production. Covers the pipeline (branch → PR → merge → deploy), database migrations, ECS rolling deploys, the manual approval gate for production, rollback, and operational runbooks.

This document assumes the topology in aws-infrastructure.md and the Terraform module layout in iac-layout.md. If you're looking for "how do I add a new feature to the deploy pipeline," start here. If you're looking for "what does the infrastructure actually look like," start with aws-infrastructure.md.

The contract

The pipeline contract — what's guaranteed, what's not:

git push to master triggers staging deploy (after CI passes)
Production deploy requires manual approval in GitHub Actions (no auto-promote in the first 6+ months post-launch)
Migrations run before service deploys — the new image isn't taking traffic until migrations are applied successfully
Rolling deploys are zero-downtime — ECS spins up new tasks, waits for ALB health checks, drains old tasks
Rollback is one-click in GitHub Actions UI; the previous image SHA stays in ECR for at least 20 deploys (lifecycle policy)
Hooks and signing are not bypassed — pre-commit hooks run, signed commits stay signed

What's not in this contract:

Auto-promote staging → production (deferred; revisit after 3–6 months of stable production operation)
Canary or blue/green for the application tier (rolling deploy is enough at this scale)
Database schema rollback (forward-only — see "Migration handling" below)
Cross-region or multi-region deploy

End-to-end flow

   Developer                    GitHub                        AWS
   ─────────                    ──────                        ───
   git checkout -b feat/X
   ... code changes ...
   git commit + git push
                       ─────►   PR opened
                                CI: make check + tests
                                CI: green ✓
                       ◄─────   Reviewer approves
   git merge to master
                       ─────►   master push triggers
                                "deploy-staging" workflow
                                ├── Build images (4 in parallel)
                                ├── Push to ECR with sha + latest tags
                                │             ─────────────────────►   ECR (4 repos)
                                ├── If migrations changed:
                                │   Run migration ECS task ─────────►   RDS (DIRECT URL)
                                │   Wait for completion
                                ├── Update ECS task definitions
                                │             ─────────────────────►   ECS Fargate
                                └── Trigger rolling deploy per service
                                              ─────────────────────►   New tasks → ALB health → drain old tasks
                                              Done in 5–10 min

   ─── Production gate ───
   Reviewer clicks "Approve" on
   "deploy-production" workflow
                       ─────►   Same pipeline, against
                                infra/envs/production
                                              ─────────────────────►   Production ECS Fargate

Branch + PR flow

Branch protection (master)

Configured on the GitHub repository:

Require a pull request before merging
Require approval from at least one reviewer
Require status checks to pass: make check (lint + typecheck + build), make test, make test-integration
Require branches to be up to date before merging
Require linear history (no merge commits — use squash or rebase)
Require signed commits
Restrict who can push to master (only via PR merge)
Disallow force pushes

Working on a change

bash

git checkout master && git pull
git checkout -b feat/your-change
# ... edit code ...
make check        # local pre-commit
git commit -s     # signed
git push -u origin feat/your-change
gh pr create --base master --title "..." --body "..."

CI runs on every push to the PR branch. The PR can't merge until CI is green and a reviewer approves. Squash-and-merge produces a single commit on master per PR.

What CI runs on every PR

Defined in .github/workflows/ci.yml:

Job	What	Required for merge
lint + format	`pnpm format:check`, `pnpm lint`, `make lint` (golangci-lint)	Yes
typecheck	`pnpm typecheck` (Next.js apps)	Yes
build	`pnpm build` + `make build`	Yes
unit tests	`make test` (Go race detector enabled)	Yes
integration tests	`make test-integration` (testcontainers Postgres + LocalStack S3 + ratelimit)	Yes
schema/classification checks	`make check` (includes `cmd/check-classification`, `cmd/check-soup`, `cmd/check-events-registry`, `cmd/check-inbound-webhooks`, `cmd/check-cata-resolution`, `cmd/check-capabilities`, `cmd/check-softdelete`, `cmd/check-migrations`)	Yes

Total CI time target: under 10 minutes for the typical PR. If it grows past 15 minutes, parallelize before adding more steps.

Deploy pipeline

Defined in .github/workflows/deploy.yml. Triggered by:

Push to master → deploys to staging automatically
Manual trigger via workflow_dispatch → deploys to production (requires approval on the GitHub environment)

Pipeline steps

yaml

# Pseudocode of the deploy workflow
steps:
  - checkout
  - configure-aws-credentials   # OIDC federation, no long-lived access keys
      role-to-assume: arn:aws:iam::ACCOUNT:role/restartix-deploy-{env}
  - login-ecr
  - build-images-in-parallel:
      - docker build -f services/api/deploy/Dockerfile.api → restartix-core-api
      - docker build -f services/api/deploy/Dockerfile.telemetry → restartix-telemetry-api
      - docker build apps/clinic → restartix-clinic
      - docker build apps/portal → restartix-portal
      - docker build apps/console → restartix-console
  - push-to-ecr:
      tag each as $sha and latest
  - detect-migrations:
      diff services/api/migrations/ between previous deploy SHA and current
      if changed → set need_migrations=true
  - run-migrations (if need_migrations):
      aws ecs run-task \
        --cluster restartix-{env} \
        --task-definition restartix-migrations:latest \
        --overrides '{"containerOverrides":[{"name":"migrate","command":["migrate-up"]}]}' \
        --launch-type FARGATE \
        --network-configuration ...
      wait for completion (poll task status)
      fail the pipeline if migrations exit non-zero
  - update-ecs-services-in-parallel:
      for each of (core-api, telemetry-api, clinic, portal, console):
        register new task definition revision pointing at $sha image
        aws ecs update-service ... --task-definition NEW_REVISION
  - wait-for-rolling-deploys:
      poll service deployment status until all show PRIMARY task set is steady
  - smoke-test:
      curl https://{env}.restartix.pro/health → expect 200
      run a small synthetic acceptance script (sign in, list orgs, etc.)
  - report:
      slack notification with deploy SHA + duration

Production approval gate

The deploy-production workflow uses a GitHub Environment named production configured to require approval from a designated reviewer set before any job can run. The same SHA that just deployed to staging is the deploy candidate — the reviewer approves the promotion, not a fresh build.

yaml

# .github/workflows/deploy-production.yml
jobs:
  deploy:
    environment:
      name: production
      url: https://app.restartix.pro
    # GitHub blocks here until a reviewer clicks "Approve"
    runs-on: ubuntu-latest
    steps: ...

Database migrations

Migrations live in services/api/migrations/core/ and are applied with golang-migrate.

How migrations run during deploy

The deploy workflow diffs services/api/migrations/core/ between the last-deployed SHA and the current SHA
If anything changed, it runs the migration as a one-shot ECS task
The migration task uses DATABASE_DIRECT_URL (RDS / Aurora cluster endpoint, bypasses pgbouncer)
The new application image isn't deployed until migrations succeed

Why DIRECT_URL: golang-migrate uses pg_advisory_lock to serialize migration runs across deploying instances. Advisory locks are session features; pgbouncer in transaction-pool mode would release them mid-migration.

Migration discipline (forward-only)

Migrations are forward-only. Down migrations exist for local dev (make migrate-down) but are not run in staging or production.
A bad migration is fixed by writing a new migration, not by reverting the bad one. Once data has been written under a schema change, reverting the schema is data loss.
Pre-production phase exception: see CLAUDE.md → "Migrations are editable pre-production." Until first production deploy, edit the original CREATE TABLE migration rather than stacking ALTERs.

Migration safety checklist (manual review on every PR with schema changes)

[ ] Adding a column with a non-null default on a large table → use a separate ALTER ADD COLUMN (nullable) → backfill in a separate migration → ALTER SET NOT NULL. Locking pattern matters.
[ ] Adding indexes on large tables → use CREATE INDEX CONCURRENTLY (golang-migrate supports this with a directive)
[ ] Renaming columns → use the expand-contract pattern (add new column, dual-write, backfill, switch reads, drop old)
[ ] Dropping columns → ensure no application code reads them; landed in a separate PR after the read paths are removed
[ ] Foreign-key changes → review for downtime risk on large referenced tables

Rolling deploys

ECS rolling deploys are configured per service in Terraform:

hcl

resource "aws_ecs_service" "core_api" {
  ...
  deployment_minimum_healthy_percent = 100   # never go below current task count
  deployment_maximum_percent         = 200   # spin up to 2× during deploy
  deployment_circuit_breaker {
    enable   = true
    rollback = true   # auto-rollback if new tasks fail health checks
  }
}

The mechanic:

New task definition revision registered (image tag updated)
ECS spins up new tasks (up to maximum_percent of desired count)
New tasks register with the ALB target group
ALB runs health checks against /health on each new task
Once new tasks are healthy, old tasks are deregistered from the target group (drained)
Old tasks receive SIGTERM; the Go app's graceful shutdown handler (30s timeout) runs
Old tasks exit; deploy is complete

For a 2-task service: 5–8 minutes end-to-end. For a 6-task service: 8–12 minutes.

Circuit breaker

If new tasks repeatedly fail health checks, ECS's deployment circuit breaker auto-rolls back to the previous task definition revision. The deploy workflow surfaces this as a failed job; alarms fire.

Rollback

Three flavors, in order of preference:

Rollback A: redeploy a previous image SHA (fastest, no code change)

The "Deploy" workflow accepts an optional image_sha input. Pass the SHA of the last known-good deploy (visible in the workflow run history). The workflow updates ECS task definitions to point at that older image and triggers a rolling deploy. Same migration-safety story applies — if the older image's migrations have been advanced past, they're already applied; the rollback runs against the newer schema (which is why migrations need to be forward-compatible per the discipline above).

bash

gh workflow run deploy-staging.yml -f image_sha=abc123def
# or for production (still requires approval gate):
gh workflow run deploy-production.yml -f image_sha=abc123def

Rollback B: revert the offending PR

Use when the issue is a code regression and rolling back via image SHA isn't enough (the same SHA that broke staging would also break a re-deploy). Open a revert PR, merge, normal pipeline applies.

Rollback C: emergency manual ECS update

Used only in active incidents when GitHub Actions itself is unavailable or when the workflow can't execute fast enough:

bash

# Find the previous task definition revision
aws ecs list-task-definitions \
  --family-prefix restartix-core-api \
  --status ACTIVE \
  --sort DESC | head -5

# Update the service to point at it
aws ecs update-service \
  --cluster restartix-production \
  --service restartix-core-api \
  --task-definition restartix-core-api:PREVIOUS_REVISION \
  --force-new-deployment

This bypasses the approval gate and is a break-glass action. Audit it explicitly afterwards (log the incident in the operations channel, capture the SHAs involved).

First-time setup

These steps are run once per AWS account when bootstrapping the deploy pipeline. They live in Terraform in infra/modules/deploy/ and infra/envs/{env}/deploy.tf.

GitHub Actions OIDC federation (no long-lived AWS keys)

hcl

resource "aws_iam_openid_connect_provider" "github" {
  url             = "https://token.actions.githubusercontent.com"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = ["..."]   # GitHub's OIDC thumbprint
}

resource "aws_iam_role" "deploy" {
  name = "restartix-deploy-${var.env}"
  assume_role_policy = data.aws_iam_policy_document.deploy_trust.json
}

data "aws_iam_policy_document" "deploy_trust" {
  statement {
    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.github.arn]
    }
    actions = ["sts:AssumeRoleWithWebIdentity"]
    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:sub"
      values   = ["repo:restartix/restartix-platform:ref:refs/heads/master"]
    }
  }
}

The deploy role's permissions are scoped to ECR push, ECS update-service, ECS RunTask (for migrations), CloudWatch Logs read (for diagnostics), and Secrets Manager read on restartix/{env}/*. Nothing else.

No AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in GitHub Secrets — the OIDC token is exchanged for short-lived STS credentials at the start of each workflow run.

GitHub Environment for production approval gate

Configured in the GitHub repo settings → Environments → production:

Required reviewers: list of designated reviewer GitHub usernames
Wait timer: 0 (no forced delay; reviewer approves when ready)
Deployment branches: master only (production deploys cannot run from feature branches)

Common operational tasks

psql access via SSM Session Manager

No bastion EC2, no SSH keys. Connect via AWS Systems Manager Session Manager port forwarding through a long-running task:

bash

# Find a Core API task ID (any one is fine — all live in the same private subnet)
aws ecs list-tasks \
  --cluster restartix-production \
  --service-name restartix-core-api \
  --query 'taskArns[0]' --output text

# Start port forwarding from RDS port 5432 to your local 15432
aws ssm start-session \
  --target ecs:restartix-production_TASK_ID_RUNTIME_ID \
  --document-name AWS-StartPortForwardingSessionToRemoteHost \
  --parameters '{"host":["RDS_WRITER_ENDPOINT"],"portNumber":["5432"],"localPortNumber":["15432"]}'

# In another terminal: connect with psql
psql "postgresql://OPS_USER@localhost:15432/restartix?sslmode=require"
# Password retrieved from Secrets Manager separately

Audit-logged via CloudTrail. Read-only ops should use the read-only role; never use the migration role for ad-hoc queries.

Tail production logs

bash

aws logs tail /ecs/restartix-production/core-api --follow
aws logs tail /ecs/restartix-production/clinic --follow --filter-pattern "ERROR"

Take a manual RDS snapshot before a risky migration

bash

aws rds create-db-snapshot \
  --db-instance-identifier restartix-production \
  --db-snapshot-identifier "pre-migration-$(date +%Y%m%d-%H%M%S)"

Manual snapshots persist until explicitly deleted. Always do this before a schema change you're nervous about.

Force a service to redeploy with the same image

bash

aws ecs update-service \
  --cluster restartix-production \
  --service restartix-core-api \
  --force-new-deployment

Useful for: picking up rotated Secrets Manager values, recovering from a stuck task, validating health-check changes.

Runbooks

Runbook: deploy stuck in "in progress"

Symptom: GitHub Actions workflow has been waiting on the rolling deploy step for >15 minutes.

Diagnose:

bash

# Check the service deployment status
aws ecs describe-services \
  --cluster restartix-{env} \
  --services restartix-core-api \
  --query 'services[].deployments'

# Look at the events stream — failures usually show here
aws ecs describe-services \
  --cluster restartix-{env} \
  --services restartix-core-api \
  --query 'services[].events[:10]'

# Check recent task failures
aws ecs list-tasks \
  --cluster restartix-{env} \
  --service-name restartix-core-api \
  --desired-status STOPPED

# For a stopped task, get the stop reason
aws ecs describe-tasks \
  --cluster restartix-{env} \
  --tasks TASK_ID \
  --query 'tasks[].[stoppedReason,containers[].reason]'

Common causes:

Health check failing (new image broken, dependency down, env var misconfigured)
Insufficient capacity in the AZ (Fargate Spot evictions in staging)
IAM role missing a required permission
Secret missing or unreadable

Resolve:

If deploy circuit breaker has rolled back, the deploy workflow shows failed; investigate the regression
If still in progress and tasks are healthy, wait — drain takes time
If genuinely stuck, manually update-service with --force-new-deployment to retry

Runbook: migration failed mid-deploy

Symptom: The deploy workflow's "run-migrations" step exited non-zero. The new application image was not deployed.

Diagnose:

bash

# Find the migration task
aws ecs list-tasks \
  --cluster restartix-{env} \
  --family restartix-migrations \
  --desired-status STOPPED

# Get logs
aws logs tail /ecs/restartix-{env}/migrations --since 30m

Resolve:

Read the migration error. Common: SQL syntax in the new migration, foreign-key violation, lock timeout on a large table.
The application is still on the previous image — service is unaffected
Fix the migration in a new commit (forward-only — never edit the failed migration's SQL after it has partially applied)
Re-run the deploy workflow

Runbook: rolling deploy fails health checks

Symptom: New tasks started but never went healthy; deployment circuit breaker rolled back.

Diagnose:

bash

# Get the new task ARN that failed
aws ecs describe-services ... --query 'services[].deployments[].failedTasks'

# Check container exit code and logs
aws ecs describe-tasks --tasks TASK_ARN
aws logs tail /ecs/restartix-{env}/core-api --since 10m

Common causes:

App can't reach RDS / Redis (security group misconfig, secret rotation went wrong)
Migration was applied but app still expects old schema (rare — usually the deploy ordering prevents this)
Image is missing a runtime dependency (binary not built correctly)

Resolve:

Circuit breaker already rolled back; service is on the previous task definition
Fix the underlying issue
Re-deploy

Runbook: emergency hotfix to production

When: A production-only bug needs immediate fixing and the standard PR-and-approval flow is too slow.

Steps:

Branch off master (not from staging — staging may have unreviewed work)
Make the minimal fix
Open a PR, mark it urgent, request review from on-call reviewer
Once reviewed + CI green, merge
Trigger the production deploy workflow manually (the approval gate still applies — but the same reviewer who approved the PR can approve the deploy)
Watch the deploy
Post-incident, write a brief postmortem in the operations log: what broke, what was changed, what to monitor going forward

Runbook: launch-day legacy-data migration

Context: RestartiX replaces a legacy product with 20k+ users / 11k+ treatment plans / 5k+ active subscriptions. On launch day, this data migrates over.

Pre-migration:

Production AWS environment is built and validated against the staging acceptance test list
Application is in maintenance mode (return 503 for all requests, branded maintenance page served by Cloudflare)
A dry run of the migration has been completed against staging using a recent snapshot of legacy data

Migration steps:

bash

# 1. Take a manual RDS snapshot of the (empty) production DB as a known-clean starting point
aws rds create-db-snapshot \
  --db-instance-identifier restartix-production \
  --db-snapshot-identifier "pre-launch-$(date +%Y%m%d-%H%M%S)"

# 2. Pull the legacy database dump (source: existing legacy product host)
pg_dump $LEGACY_DATABASE_URL --format=custom --no-owner > legacy-data.pgdump

# 3. Run the data-transform pipeline (legacy schema → restartix schema). This is a one-time
#    Go binary not in the standard pipeline; lives at services/migration-tools/legacy-import/.
#    It reads the legacy dump, transforms records, writes via the standard repo layer (so RLS
#    and audit log work normally).
go run ./services/migration-tools/legacy-import \
  --source legacy-data.pgdump \
  --target $DATABASE_DIRECT_URL

# 4. Validate row counts against expected targets
psql "$DATABASE_DIRECT_URL" -c "SELECT count(*) FROM organizations"
psql "$DATABASE_DIRECT_URL" -c "SELECT count(*) FROM humans"
psql "$DATABASE_DIRECT_URL" -c "SELECT count(*) FROM organization_subscriptions WHERE status='active'"

# 5. Verify a few legacy users can sign in via the production app

# 6. Take another manual snapshot — the loaded but pre-traffic state
aws rds create-db-snapshot \
  --db-instance-identifier restartix-production \
  --db-snapshot-identifier "post-import-$(date +%Y%m%d-%H%M%S)"

# 7. Remove maintenance mode (Cloudflare rule update)

# 8. Watch monitoring for 24h

Rollback plan (if data import was wrong):

Restore RDS from the pre-launch-... snapshot
Re-enable maintenance mode
Fix the import
Re-run

Keep the legacy database operational for 7 days post-launch as a safety net. Then decommission.

AWS infrastructure — full topology and cost
IaC layout — Terraform modules and where the deploy IAM role lives
Scaling architecture — task sizing and auto-scaling parameters
Backup & DR — manual snapshots, PITR, restore runbooks
Monitoring — alarms that fire during a bad deploy
Decisions — why ECS Fargate, why Terraform, why manual approval gate

Deployment & CI/CD ​

The contract ​

End-to-end flow ​

Branch + PR flow ​

Branch protection (master) ​

Working on a change ​

What CI runs on every PR ​

Deploy pipeline ​

Pipeline steps ​

Production approval gate ​

Database migrations ​

How migrations run during deploy ​

Migration discipline (forward-only) ​

Migration safety checklist (manual review on every PR with schema changes) ​

Rolling deploys ​

Circuit breaker ​

Rollback ​

Rollback A: redeploy a previous image SHA (fastest, no code change) ​

Rollback B: revert the offending PR ​

Rollback C: emergency manual ECS update ​

First-time setup ​

GitHub Actions OIDC federation (no long-lived AWS keys) ​

GitHub Environment for production approval gate ​

Common operational tasks ​

psql access via SSM Session Manager ​

Tail production logs ​

Take a manual RDS snapshot before a risky migration ​

Force a service to redeploy with the same image ​

Runbooks ​

Runbook: deploy stuck in "in progress" ​

Runbook: migration failed mid-deploy ​

Runbook: rolling deploy fails health checks ​

Runbook: emergency hotfix to production ​

Runbook: launch-day legacy-data migration ​

Related documentation ​

Deployment & CI/CD

The contract

End-to-end flow

Branch + PR flow

Branch protection (master)

Working on a change

What CI runs on every PR

Deploy pipeline

Pipeline steps

Production approval gate

Database migrations

How migrations run during deploy

Migration discipline (forward-only)

Migration safety checklist (manual review on every PR with schema changes)

Rolling deploys

Circuit breaker

Rollback

Rollback A: redeploy a previous image SHA (fastest, no code change)

Rollback B: revert the offending PR

Rollback C: emergency manual ECS update

First-time setup

GitHub Actions OIDC federation (no long-lived AWS keys)

GitHub Environment for production approval gate

Common operational tasks

psql access via SSM Session Manager

Tail production logs

Take a manual RDS snapshot before a risky migration

Force a service to redeploy with the same image

Runbooks

Runbook: deploy stuck in "in progress"

Runbook: migration failed mid-deploy

Runbook: rolling deploy fails health checks

Runbook: emergency hotfix to production

Runbook: launch-day legacy-data migration

Related documentation