Skip to content

IaC Layout (Terraform)

The Terraform module structure that backs the AWS infrastructure documented in aws-infrastructure.md. This is the source of truth for "where does the configuration for X actually live?"

Terraform was the chosen IaC tool for the reasons in decisions.md → Why Terraform for infrastructure as code. State backend is S3 with native conditional-write locking (use_lockfile = true), inside the same AWS account as the resources. No separate DynamoDB table — S3 itself coordinates concurrent runs via If-None-Match on a lock-file object.


Repository layout

The Terraform code lives in the same monorepo as the application code:

restartix-platform/
├── apps/                       # Next.js apps (clinic, portal, console, docs)
├── services/                   # Go services (api/, future Layer 2: telemetry/)
├── packages/                   # Shared TS packages
├── infra/                      # ← all Terraform here
│   ├── modules/                # reusable modules (no provider config)
│   │   ├── network/
│   │   ├── database-rds/
│   │   ├── database-aurora-serverless/
│   │   ├── cache-redis/
│   │   ├── ecs-cluster/
│   │   ├── ecs-service/
│   │   ├── scheduled-tasks/
│   │   ├── storage-s3/
│   │   ├── storage-backups/
│   │   ├── email-ses/
│   │   ├── observability/
│   │   ├── deploy-iam/
│   │   ├── edge-cloudflare/
│   │   └── tfstate-backend/
│   ├── envs/
│   │   ├── bootstrap/          # ← one-time apply with local state
│   │   ├── staging/            # ← staging composition
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── terraform.tfvars
│   │   │   ├── backend.tf
│   │   │   └── outputs.tf
│   │   └── production/         # ← production composition (built later)
│   └── README.md               # how to run Terraform locally + in CI
└── ...

Why in the monorepo: application code and infrastructure code change together — adding a new service touches both services/api/cmd/ and infra/envs/staging/. Splitting the repos creates a coordination tax for no upside.

Why modules/ separate from envs/: modules are reusable building blocks with no provider config and no hardcoded values. Environments compose modules with environment-specific variables. Same modules → same shape across staging and production.


State backend

S3 bucket:           restartix-tfstate
Region:              eu-central-1
Encryption:          AWS-managed KMS (SSE-S3)
Versioning:          Enabled (allows recovery from accidental terraform state corruption)
Public access:       Blocked
Locking:             S3 native conditional writes (use_lockfile = true per env backend)

State is partitioned per environment via S3 key prefix:

s3://restartix-tfstate/staging/terraform.tfstate
s3://restartix-tfstate/production/terraform.tfstate

The infra/modules/tfstate-backend/ module bootstraps the bucket — chicken-and-egg since Terraform needs state to create state-backing resources. Bootstrap is a one-time terraform apply from a workstation with admin AWS credentials. The bootstrap composition uses local state (gitignored); per-env compositions (staging, production) use the S3 bucket as their remote backend with native locking. See infra/README.md for the bootstrap runbook.


Module: network

VPC, subnets, NAT, security groups, VPC endpoints. The networking foundation that every other module depends on.

Inputs: env, vpc_cidr, availability_zones, nat_strategy (one of nat-instance or nat-gateway)

Outputs: vpc_id, private_subnet_ids, public_subnet_ids, security group IDs (alb_sg, fargate_app_sg, fargate_pgbouncer_sg, rds_sg, redis_sg, migrations_runner_sg)

Key resources:

  • aws_vpc
  • aws_subnet (2 public + 2 private, one pair per AZ)
  • aws_internet_gateway
  • aws_nat_instance (staging) OR aws_nat_gateway (production)
  • aws_route_table for public + private route tables
  • aws_security_group × 6 (per the table in aws-infrastructure.md → Networking)
  • aws_vpc_endpoint for S3 (gateway endpoint) + ECR / Secrets Manager / KMS / CloudWatch Logs (interface endpoints)

Per-env differences:

  • Staging uses nat_strategy = "nat-instance" (t4g.nano, ~$3/mo, single point of failure)
  • Production uses nat_strategy = "nat-gateway" (~$38/mo, per-AZ HA optional via nat_per_az variable for F11)

Module: database-rds

RDS Postgres instance with parameter group, subnet group, monitoring, and snapshot policy. Used by production.

Inputs: env, instance_class, allocated_storage, max_allocated_storage, multi_az, backup_retention_period, vpc_id, subnet_ids, security_group_ids, kms_key_id

Outputs: instance_id, writer_endpoint, reader_endpoint (when read replicas exist), port, db_name

Key resources:

  • aws_db_subnet_group
  • aws_db_parameter_group with the platform's required parameters (rds.force_ssl=1, shared_preload_libraries=pg_stat_statements, max_connections=200, etc.)
  • aws_db_instance (primary)
  • aws_db_instance × N for read replicas (Phase 2+; gated on a count variable)
  • aws_db_snapshot triggered manually via Terraform on intentional snapshot moments

Notes:

  • Performance Insights enabled (free tier, 7-day retention)
  • Enhanced Monitoring enabled at 1-min granularity
  • The master_user_password is generated via random_password resource and stored in Secrets Manager — never set as a literal in the .tfvars file
  • apply_immediately = false for parameter changes — they apply during the next maintenance window unless explicitly forced

Module: database-aurora-serverless

Aurora Serverless v2 cluster. Used by staging.

Inputs: env, min_capacity (ACU), max_capacity (ACU), enable_scale_to_zero, vpc_id, subnet_ids, security_group_ids, kms_key_id, backup_retention_period

Outputs: cluster_endpoint, reader_endpoint, port

Key resources:

  • aws_db_subnet_group
  • aws_rds_cluster_parameter_group (Aurora-flavored equivalent of RDS parameter group, same effective parameters)
  • aws_rds_cluster with engine_mode = "provisioned" and serverlessv2_scaling_configuration
  • aws_rds_cluster_instance × 1 (writer; no readers in staging)

Notes:

  • Same wire protocol as database-rds — application sees no difference
  • Scale-to-zero requires min_capacity = 0 and is a 2024-late feature — confirm Terraform AWS provider version supports it (>=v5.70)
  • Backup retention 1 day is acceptable for staging

Module: cache-redis

ElastiCache Redis cluster.

Inputs: env, node_type, num_cache_nodes, replicas (0 for staging, 1 for production), vpc_id, subnet_ids, security_group_ids

Outputs: primary_endpoint, port

Key resources:

  • aws_elasticache_subnet_group
  • aws_elasticache_replication_group with at_rest_encryption_enabled = true, transit_encryption_enabled = true

Notes:

  • AUTH token generated via random_password, stored in Secrets Manager
  • For staging: single-AZ single-node; for production: Multi-AZ primary + 1 replica

Module: ecs-cluster

The ECS cluster + capacity provider settings. Lightweight — most of the interesting configuration lives in ecs-service.

Inputs: env, enable_fargate_spot (true for staging app services, false for production)

Outputs: cluster_id, cluster_name

Key resources:

  • aws_ecs_cluster
  • aws_ecs_cluster_capacity_providers (FARGATE + FARGATE_SPOT for staging)

Module: ecs-service (the workhorse)

A single Fargate service: task definition, service, target group, auto-scaling, IAM roles, log group. Every Fargate service in the architecture is one instantiation of this module.

Inputs: env, service_name, image_uri, port, cpu, memory, desired_count, min_capacity, max_capacity, target_cpu_pct, health_check_path, secrets_arns (list of Secrets Manager ARNs to inject), additional_iam_policy_arns, cluster_id, vpc_id, subnet_ids, security_group_ids, alb_listener_arn, host_header_values (for ALB host-based routing rules), use_spot (bool, staging pattern)

Outputs: service_arn, task_definition_arn, target_group_arn, task_role_arn

Key resources:

  • aws_ecs_task_definition
  • aws_ecs_service
  • aws_lb_target_group with health check on health_check_path
  • aws_lb_listener_rule matching host_header_values → forward to target group
  • aws_appautoscaling_target + aws_appautoscaling_policy (target tracking on CPU)
  • aws_iam_role for the task (task_role) + execution (execution_role)
  • aws_cloudwatch_log_group with environment-appropriate retention

Why this is the workhorse: every long-running process — Core API, Telemetry API, Clinic, Portal, Console, pgbouncer — is a module "X" { source = "../../modules/ecs-service" } block. Adding a new long-running service is a 30-line addition to infra/envs/{env}/main.tf.


Module: scheduled-tasks

EventBridge Scheduler rules + the IAM plumbing to invoke ECS RunTask on a schedule. Provisions the four foundation cron jobs called out in aws-infrastructure.md → Scheduled tasks and is reused by Layer 2 backups (storage-backups registers its own task definition; this module schedules it).

Inputs: env, cluster_arn, tasks (map of name{ task_definition_arn, container_name, command, schedule_expression, subnet_ids, security_group_ids, dead_letter_arn }), optional default_subnet_ids / default_security_group_ids to keep per-task config minimal

Outputs: schedule_arns (map of name → schedule ARN), scheduler_role_arn

Key resources:

  • aws_iam_role — single role assumed by EventBridge Scheduler; trusts scheduler.amazonaws.com
  • aws_iam_role_policy granting ecs:RunTask (scoped by task-definition family) + iam:PassRole (scoped to each task's task_role and execution_role)
  • aws_scheduler_schedule — one per entry in the tasks map, with target type ECS and the supplied container command override
  • aws_cloudwatch_log_group for scheduler-side error logs (separate from the task's own log group)

Foundation cron registry (consumed by infra/envs/{env}/main.tf):

TaskSchedule (UTC)ImageNotes
audit-partition-rollDay 1 of each month, 02:00restartix-core-apiProvisions next 3 monthly partitions
usage-quota-resetDay 1 of each month, 00:05restartix-core-apiResets quota counters and advances period
usage-summary-rollupDay 1 of each month, 03:00restartix-core-apiCloses prior month's usage summaries
check-providersEvery 5 min (staging) / every 1 min (prod)restartix-core-apiHealthchecks platform_service_providers
backup-runnerDaily 02:00 (production); off in stagingrestartix-core-apiWired by storage-backups (output → input here)

Notes:

  • The four Core-API-image jobs share Core API's task role + execution role — different command, same task definition family. The tasks input takes the task_definition_arn output by the Core API ecs-service module.
  • backup-runner registers its own task definition inside storage-backups (separate IAM scope: S3 write-only on the backup bucket, RDS read on the database via DATABASE_DIRECT_URL). storage-backups outputs task_definition_arn; the env composition feeds it into tasks["backup-runner"] here.
  • The schedule expression for backup-runner is a per-env Terraform variable. Staging defaults to null (rule disabled); production defaults to cron(0 2 * * ? *). Per backup-disaster-recovery.md → Staging knobs the staging cron is off by default.
  • Failure handling uses aws_scheduler_schedule.flexible_time_window with OFF (deterministic timing) and a SQS dead-letter queue input via dead_letter_arn for jobs that fail to launch (vs. fail mid-run, which CloudWatch alarms catch on the task side).

Module: storage-s3

S3 buckets with the platform's standard policy: versioning enabled, public access blocked, server-side encryption, lifecycle policy.

Inputs: env, bucket_name, lifecycle_strategy (one of none, audit-archive), enable_object_lock (bool, true for audit-archive), kms_key_id

Outputs: bucket_name, bucket_arn

Key resources:

  • aws_s3_bucket
  • aws_s3_bucket_versioning
  • aws_s3_bucket_server_side_encryption_configuration
  • aws_s3_bucket_public_access_block (all four blocks: true)
  • aws_s3_bucket_lifecycle_configuration (when lifecycle_strategy = "audit-archive": Standard → Glacier IA at 90d → Deep Archive at 365d)
  • aws_s3_bucket_object_lock_configuration (when enable_object_lock = true, COMPLIANCE mode, 7-year retention)

Used for: the restartix-uploads-{env} bucket and the restartix-audit-archive-{env} bucket per aws-infrastructure.md → Object storage. The Layer 2 backup bucket is not provisioned here — it has different IAM, KMS, and immutability requirements and lives in storage-backups.


Module: storage-backups

The Layer 2 backup substrate per backup-disaster-recovery.md → Layer 2: immutable S3 bucket with a separate KMS context, the daily pg_dump ECS task definition, and the IAM role that bounds blast radius if the application credentials are ever compromised. Schedule wiring is done by scheduled-tasks — this module outputs the task-definition ARN.

Inputs: env, bucket_name (e.g. restartix-backups-primary-{env}), runner_image_uri, database_secret_arn (Secrets Manager ARN holding DATABASE_DIRECT_URL for pg_dump), backup_encryption_secret_arn (Secrets Manager ARN holding BACKUP_ENCRYPTION_KEY), cluster_arn, subnet_ids, security_group_ids, migrations_runner_sg_id

Outputs: bucket_name, bucket_arn, kms_key_arn, task_definition_arn, task_role_arn

Key resources:

  • aws_kms_key — separate from the platform's primary CMK. Different key policy: only the backup task role + the operations role can Decrypt; the application Fargate task role explicitly cannot. Different failure domain at the credentials layer.
  • aws_s3_bucket with aws_s3_bucket_versioning (enabled), aws_s3_bucket_object_lock_configuration (COMPLIANCE mode, 7-year retention), aws_s3_bucket_server_side_encryption_configuration (KMS, customer-managed key from this module), aws_s3_bucket_public_access_block (all four blocks: true)
  • aws_s3_bucket_lifecycle_configuration (Standard → Glacier IR at 90d → Deep Archive at 730d, per the Layer 2 spec)
  • aws_ecs_task_definition — the backup runner. Reuses Core API's image but with an entrypoint that runs the cmd/backup-runner binary (or a shell pipeline for pg_dump | gzip | openssl enc | aws s3 cp, decided in implementation). Sized small (0.25 vCPU / 1 GB).
  • aws_iam_role (task_role) with: s3:PutObject + s3:GetObject on the backup bucket only, kms:Decrypt/Encrypt/GenerateDataKey on this module's CMK only, secretsmanager:GetSecretValue on database_secret_arn and backup_encryption_secret_arn only. Cannot read the application's Fargate task role permissions.
  • aws_iam_role (execution_role) — minimal, ECR pull + CloudWatch Logs write
  • aws_security_group_rule adding the backup runner's SG to the migrations-runner family so it can reach RDS on 5432 directly (bypasses pgbouncer; pg_dump uses session-level state pgbouncer in transaction mode breaks)
  • aws_cloudwatch_log_group /ecs/restartix-{env}/backup-runner

Per-env differences:

  • Staging: bucket exists, IAM and task definition exist, KMS key exists. Cron is off by default (see scheduled-tasks notes). One end-to-end manual test against staging is the 1E.3 gate.
  • Production: same shape, cron on at launch via scheduled-tasks.

Why a dedicated module instead of a lifecycle_strategy flag on storage-s3: the spec calls for a "different failure domain at the credentials layer" — separate IAM, separate KMS context, separate ECS task with restricted permissions. Folding all that into storage-s3 would dilute the module's responsibility and put backup-specific resources behind generic flags. Keeping it dedicated also makes "what does the backup runner have access to?" answerable by reading one module file.


Module: email-ses

SES domain identity, DKIM tokens, configuration set, and the SNS topic that receives bounce/complaint events for the foundation suppression list. The DKIM/SPF/DMARC DNS records live in edge-cloudflare (this module outputs the tokens; that module places them in Cloudflare DNS) — email-ses is the AWS-side half of the pair.

Inputs: env, sender_domain (e.g. notifications-staging.restartix.pro), bounce_webhook_url (Core API endpoint URL or null for staging), kms_key_arn (envelope key for the configuration set's IP-pool config, optional)

Outputs: dkim_records (list of { name, type = "CNAME", value } to feed into edge-cloudflare), spf_record, dmarc_record, configuration_set_name, bounce_topic_arn, complaint_topic_arn

Key resources:

  • aws_sesv2_email_identity for the sender domain
  • aws_sesv2_email_identity_dkim_signing_attributes to enable Easy DKIM and produce the three CNAME tokens
  • aws_sesv2_configuration_set (restartix-{env}) with delivery_options.tls_policy = "REQUIRE", reputation_options.enabled = true, sending_options.enabled = true
  • aws_sesv2_configuration_set_event_destination ×2 — one routing BOUNCE events to the bounce SNS topic, one routing COMPLAINT events to the complaint topic. The Core API endpoint subscribes via HTTPS subscription; SNS handles retry + DLQ.
  • aws_sns_topic × 2 (restartix-{env}-ses-bounces, restartix-{env}-ses-complaints)
  • aws_sns_topic_subscription × 2 (HTTPS, pointing at bounce_webhook_url if set)
  • aws_iam_policy_document granting ses:SendEmail / ses:SendRawEmail to the Core API task role, scoped to the configuration set ARN

Out of scope (manual or post-apply steps):

  • Sandbox exit — AWS Support ticket, ~24-48h human turnaround. Not Terraform-able. Tracked in 1E.3 checklist.
  • Sending-limit increases — also a support ticket.
  • Cat A provider row in platform_service_providers — populated by bootstrapProviderDefaults on first boot, not by Terraform. The restartix/{env}/email-bootstrap Secrets Manager secret (created by the env composition, not this module) holds the seed values that bootstrapping reads once.

Why split SES from observability: the SES bounce/complaint topics are application data plane (a missed bounce notification means a recipient stays in the platform's send list), not operational alerting. They have different subscribers, different retention, different IAM. Different concerns, different module.


Module: observability

CloudWatch alarms, SNS topic for operational alert fan-out, AWS Chatbot integration with Slack.

Inputs: env, alarms_email_subscribers, slack_webhook_url, references to the resources to monitor (ALB ARN, RDS instance ID, ECS cluster name, etc.)

Outputs: sns_topic_arn, dashboard_url

Key resources:

  • aws_sns_topic (restartix-alerts-{env})
  • aws_sns_topic_subscription for email + Chatbot
  • aws_cloudwatch_metric_alarm × N — see monitoring.md → CloudWatch alarms for the full list
  • aws_cloudwatch_dashboard × per-area (database, API, app health, business metrics)

Module: deploy-iam

The GitHub Actions OIDC provider + the deploy IAM role. Run once per AWS account.

Inputs: github_org, github_repo, allowed_branches (default: ["master"])

Outputs: deploy_role_arn, oidc_provider_arn

Key resources:

  • aws_iam_openid_connect_provider
  • aws_iam_role (the deploy role assumed by GitHub Actions OIDC)
  • aws_iam_role_policy with the least-privilege deploy policy (ECR push, ECS update-service, ECS RunTask for migrations, Secrets Manager read on restartix/{env}/*, CloudWatch Logs read)

Module: edge-cloudflare

Cloudflare resources via the cloudflare/cloudflare provider. Manages the configuration that lives at the edge — DNS records pointing at the AWS ALB, page rules, WAF custom rules, Cloudflare for SaaS hostnames API token.

Inputs: env, cloudflare_zone_id, alb_dns_name, cloudflare_account_id

Outputs: cf_api_token_secret_arn (token for the application to call Cloudflare for SaaS API)

Key resources:

  • cloudflare_record for each subdomain (*.clinic.restartix.pro, *.portal.restartix.pro, console.restartix.pro, the Cloudflare-origin hostname for SaaS)
  • cloudflare_page_rule for static-asset cache rules
  • cloudflare_ruleset for WAF custom rules (basic Day-1 ruleset, expanded post-launch)
  • cloudflare_api_token scoped to Custom Hostnames management on the SaaS subdomain — token written to AWS Secrets Manager for the application to consume

Notes:

  • The Cloudflare provider needs its own credentials — provided as the CLOUDFLARE_API_TOKEN env var to Terraform (read from Secrets Manager via the deploy workflow, never committed)
  • Production and staging share the Cloudflare zone but have separate page rules + record sets

Module: tfstate-backend

Bootstraps the S3 bucket that holds Terraform state itself. Run once per AWS account during initial setup. State locking is handled by S3 native conditional writes — no DynamoDB table needed.

Inputs: bucket_name, lock_table_name

Outputs: state_bucket, lock_table

This is the only module designed to be terraform apply'd with local state first, then migrated to remote. See infra/README.md for the bootstrap procedure.


Per-tenant infrastructure (deferred — per-entitlement, not bundled)

Every module above is per-environment (staging, production). When entitlements that require per-tenant AWS resources ship, each ships independently as its own module + catalog entry. There is no single "dedicated tenant" bundle — the schema is intentionally a single-axis tenancy_mode enum + independent paid entitlements so each operational piece can ship on its own funded timeline.

Three deferred entitlements have per-tenant infrastructure shapes today:

  • own_s3_bucket — a per-tenant S3 module (bucket + bucket policy + CORS + IAM bindings), plumbing in the storage capability to route uploads to the per-org bucket, and the exit/portability tool that uses the per-org bucket to produce GDPR export packages. Available on either tenancy mode — a shared-mode clinic can buy this addon without becoming tenancy_mode='dedicated'.
  • own_cmk — a per-tenant CMK module (KMS key + alias + key policy), plumbing in the encryption capability to wrap column-level encryption with the per-org CMK, and the crypto-shred runbook documenting how the key is retired on contract termination. Also available on either tenancy mode.
  • tenancy_mode='dedicated' provisioner — the per-tenant Clerk org provisioning code in the auth-provider abstraction + the per-org platform_service_providers override row that binds the org to its dedicated identity namespace. Flips tenancy_mode='dedicated' from a schema reservation into a sellable product.

Each ships as one PR when the first paying customer funds it. The pattern when one lands is for_each over tenant rows that hold the relevant entitlement, sourced from a config file at infra/envs/dedicated/tenants.hcl (or a per-entitlement subdir if the shapes diverge). None of these are part of foundation 1E.3.

Full deferred scope: tenant-isolation.md → Deferred design surface. Related ADRs: Why tenancy_mode is a single enum, not multi-axis and Why Terraform PR + Console finalize for dedicated-mode provisioning.


Environment composition

infra/envs/staging/main.tf is the top-level composition. It instantiates every module above with staging-specific values.

hcl
# infra/envs/staging/main.tf — abbreviated

module "network" {
  source             = "../../modules/network"
  env                = "staging"
  vpc_cidr           = "10.10.0.0/16"
  availability_zones = ["eu-central-1a", "eu-central-1b"]
  nat_strategy       = "nat-instance"
}

module "database" {
  source                  = "../../modules/database-aurora-serverless"
  env                     = "staging"
  min_capacity            = 0    # scale-to-zero
  max_capacity            = 2
  enable_scale_to_zero    = true
  vpc_id                  = module.network.vpc_id
  subnet_ids              = module.network.private_subnet_ids
  security_group_ids      = [module.network.rds_sg]
  backup_retention_period = 1
}

module "cache" {
  source             = "../../modules/cache-redis"
  env                = "staging"
  node_type          = "cache.t4g.micro"
  num_cache_nodes    = 1
  replicas           = 0
  vpc_id             = module.network.vpc_id
  subnet_ids         = module.network.private_subnet_ids
  security_group_ids = [module.network.redis_sg]
}

module "ecs_cluster" {
  source              = "../../modules/ecs-cluster"
  env                 = "staging"
  enable_fargate_spot = true
}

module "core_api" {
  source             = "../../modules/ecs-service"
  env                = "staging"
  service_name       = "core-api"
  image_uri          = "${aws_ecr_repository.core_api.repository_url}:latest"
  port               = 9000
  cpu                = 512                                 # 0.5 vCPU
  memory             = 1024                                # 1 GB
  desired_count      = 1
  min_capacity       = 1
  max_capacity       = 3
  target_cpu_pct     = 70
  health_check_path  = "/health"
  use_spot           = true
  secrets_arns       = [
    aws_secretsmanager_secret.database.arn,
    aws_secretsmanager_secret.redis.arn,
    aws_secretsmanager_secret.encryption.arn,
    aws_secretsmanager_secret.clerk.arn,
    aws_secretsmanager_secret.external.arn,
  ]
  cluster_id         = module.ecs_cluster.cluster_id
  vpc_id             = module.network.vpc_id
  subnet_ids         = module.network.private_subnet_ids
  security_group_ids = [module.network.fargate_app_sg]
  alb_listener_arn   = aws_lb_listener.https.arn
  host_header_values = ["api-staging.restartix.pro"]
}

# ... clinic, portal, console, telemetry-api, pgbouncer, all similar shape ...

module "uploads_bucket" {
  source             = "../../modules/storage-s3"
  env                = "staging"
  bucket_name        = "restartix-uploads-staging"
  lifecycle_strategy = "none"
}

module "audit_archive_bucket" {
  source              = "../../modules/storage-s3"
  env                 = "staging"
  bucket_name         = "restartix-audit-archive-staging"
  lifecycle_strategy  = "audit-archive"
  enable_object_lock  = true
}

module "backups" {
  source                        = "../../modules/storage-backups"
  env                           = "staging"
  bucket_name                   = "restartix-backups-primary-staging"
  runner_image_uri              = "${aws_ecr_repository.core_api.repository_url}:latest"
  database_secret_arn           = aws_secretsmanager_secret.database.arn
  backup_encryption_secret_arn  = aws_secretsmanager_secret.encryption.arn
  cluster_arn                   = module.ecs_cluster.cluster_arn
  subnet_ids                    = module.network.private_subnet_ids
  security_group_ids            = [module.network.migrations_runner_sg]
  migrations_runner_sg_id       = module.network.migrations_runner_sg
}

module "email" {
  source             = "../../modules/email-ses"
  env                = "staging"
  sender_domain      = "notifications-staging.restartix.pro"
  bounce_webhook_url = null  # Core API endpoint not wired in staging until 1E.3
}

module "scheduled_tasks" {
  source      = "../../modules/scheduled-tasks"
  env         = "staging"
  cluster_arn = module.ecs_cluster.cluster_arn

  default_subnet_ids         = module.network.private_subnet_ids
  default_security_group_ids = [module.network.fargate_app_sg]

  tasks = {
    audit-partition-roll = {
      task_definition_arn = module.core_api.task_definition_arn
      container_name      = "core-api"
      command             = ["audit-partition-roll"]
      schedule_expression = "cron(0 2 1 * ? *)"
    }
    usage-quota-reset = {
      task_definition_arn = module.core_api.task_definition_arn
      container_name      = "core-api"
      command             = ["usage-quota-reset"]
      schedule_expression = "cron(5 0 1 * ? *)"
    }
    usage-summary-rollup = {
      task_definition_arn = module.core_api.task_definition_arn
      container_name      = "core-api"
      command             = ["usage-summary-rollup"]
      schedule_expression = "cron(0 3 1 * ? *)"
    }
    check-providers = {
      task_definition_arn = module.core_api.task_definition_arn
      container_name      = "core-api"
      command             = ["check-providers"]
      schedule_expression = "rate(5 minutes)"
    }
    backup-runner = {
      task_definition_arn = module.backups.task_definition_arn
      container_name      = "backup-runner"
      command             = []
      schedule_expression = null  # off by default in staging
      subnet_ids          = module.network.private_subnet_ids
      security_group_ids  = [module.network.migrations_runner_sg]
    }
  }
}

# ... observability, edge-cloudflare, deploy-iam ...

infra/envs/production/main.tf is the same shape with production-grade values (RDS Multi-AZ, on-demand Fargate, NAT Gateway, replicas, longer log retention). Same modules → guaranteed shape parity.


Secrets handling

Terraform never holds long-lived secret values. The pattern:

  1. Terraform creates the Secrets Manager secret (the container)
  2. Terraform may set an initial value if it can be generated (e.g., random_password for the RDS master password)
  3. The production values (Clerk secret key, Daily.co API key, third-party API keys) are set out-of-band via aws secretsmanager update-secret from a workstation with the operations role, never committed to Terraform .tfvars
  4. The application reads secrets at task startup via the ECS task definition secrets: block — values are resolved from Secrets Manager and injected as environment variables

terraform.tfvars files contain only non-secret environment-specific configuration (instance classes, region, AZ list, scaling bounds). Secrets stay in Secrets Manager.

Cloudflare API token for the Terraform cloudflare provider: stored in AWS Secrets Manager, fetched by the deploy workflow, set as CLOUDFLARE_API_TOKEN env var for the terraform apply step. Never committed.


Naming conventions

  • Resource names: restartix-{env}-{logical-name}. Example: restartix-staging-core-api (ECS service), restartix-production-uploads (S3 bucket).
  • Tags on every resource:
    • Project = "restartix"
    • Environment = var.env (staging or production)
    • ManagedBy = "terraform"
    • Owner = "platform-team"
    • CostCenter = "platform" (used for AWS Cost Explorer breakdowns)
  • Terraform module variables: snake_case. Output names: snake_case.
  • AWS resource physical names: lowercase with hyphens.

How a typical change works

  1. Edit Terraform in your branch — for example, raise max_capacity for the Core API service from 5 to 8
  2. Run terraform plan locally against staging:
    bash
    cd infra/envs/staging
    terraform plan
  3. Review the diff — should show only the auto-scaling-target update, no other drift
  4. Open a PR with the Terraform change. CI runs terraform plan again and posts the diff as a PR comment for reviewer visibility
  5. Merge the PR. GitHub Actions runs terraform apply against staging
  6. After staging is verified, manually trigger the production apply workflow (which uses the same approval gate as application deploys)

Gotchas

  • Drift between Terraform and Console clicks. If anything is changed via the AWS Console or CLI directly, terraform plan will show it as drift on the next run. Don't change resources via the Console for anything Terraform owns — emergency exceptions logged in operations log and reconciled into Terraform within a week.
  • Terraform state file is sensitive. It contains output values from sensitive resources (DB endpoints, KMS key ARNs, secrets ARNs). The S3 backend is encrypted and access-controlled, but never cat the state file in a shared session.
  • terraform destroy of an environment is destructive. Aurora Serverless v2 retains a final snapshot only if skip_final_snapshot = false (we set this). RDS likewise. Verify before running destroy.
  • Module changes affect every environment. A change to infra/modules/ecs-service/ ripples through staging and production both. Test against staging first.
  • Deploy IAM role permissions evolve. When adding a new resource type, the deploy role may need an additional IAM permission. The deploy-iam module's policy needs updating in lockstep.