IaC Layout (Terraform)
The Terraform module structure that backs the AWS infrastructure documented in aws-infrastructure.md. This is the source of truth for "where does the configuration for X actually live?"
Terraform was the chosen IaC tool for the reasons in decisions.md → Why Terraform for infrastructure as code. State backend is S3 with native conditional-write locking (use_lockfile = true), inside the same AWS account as the resources. No separate DynamoDB table — S3 itself coordinates concurrent runs via If-None-Match on a lock-file object.
Repository layout
The Terraform code lives in the same monorepo as the application code:
restartix-platform/
├── apps/ # Next.js apps (clinic, portal, console, docs)
├── services/ # Go services (api/, future Layer 2: telemetry/)
├── packages/ # Shared TS packages
├── infra/ # ← all Terraform here
│ ├── modules/ # reusable modules (no provider config)
│ │ ├── network/
│ │ ├── database-rds/
│ │ ├── database-aurora-serverless/
│ │ ├── cache-redis/
│ │ ├── ecs-cluster/
│ │ ├── ecs-service/
│ │ ├── scheduled-tasks/
│ │ ├── storage-s3/
│ │ ├── storage-backups/
│ │ ├── email-ses/
│ │ ├── observability/
│ │ ├── deploy-iam/
│ │ ├── edge-cloudflare/
│ │ └── tfstate-backend/
│ ├── envs/
│ │ ├── bootstrap/ # ← one-time apply with local state
│ │ ├── staging/ # ← staging composition
│ │ │ ├── main.tf
│ │ │ ├── variables.tf
│ │ │ ├── terraform.tfvars
│ │ │ ├── backend.tf
│ │ │ └── outputs.tf
│ │ └── production/ # ← production composition (built later)
│ └── README.md # how to run Terraform locally + in CI
└── ...Why in the monorepo: application code and infrastructure code change together — adding a new service touches both services/api/cmd/ and infra/envs/staging/. Splitting the repos creates a coordination tax for no upside.
Why modules/ separate from envs/: modules are reusable building blocks with no provider config and no hardcoded values. Environments compose modules with environment-specific variables. Same modules → same shape across staging and production.
State backend
S3 bucket: restartix-tfstate
Region: eu-central-1
Encryption: AWS-managed KMS (SSE-S3)
Versioning: Enabled (allows recovery from accidental terraform state corruption)
Public access: Blocked
Locking: S3 native conditional writes (use_lockfile = true per env backend)State is partitioned per environment via S3 key prefix:
s3://restartix-tfstate/staging/terraform.tfstate
s3://restartix-tfstate/production/terraform.tfstateThe infra/modules/tfstate-backend/ module bootstraps the bucket — chicken-and-egg since Terraform needs state to create state-backing resources. Bootstrap is a one-time terraform apply from a workstation with admin AWS credentials. The bootstrap composition uses local state (gitignored); per-env compositions (staging, production) use the S3 bucket as their remote backend with native locking. See infra/README.md for the bootstrap runbook.
Module: network
VPC, subnets, NAT, security groups, VPC endpoints. The networking foundation that every other module depends on.
Inputs: env, vpc_cidr, availability_zones, nat_strategy (one of nat-instance or nat-gateway)
Outputs: vpc_id, private_subnet_ids, public_subnet_ids, security group IDs (alb_sg, fargate_app_sg, fargate_pgbouncer_sg, rds_sg, redis_sg, migrations_runner_sg)
Key resources:
aws_vpcaws_subnet(2 public + 2 private, one pair per AZ)aws_internet_gatewayaws_nat_instance(staging) ORaws_nat_gateway(production)aws_route_tablefor public + private route tablesaws_security_group× 6 (per the table in aws-infrastructure.md → Networking)aws_vpc_endpointfor S3 (gateway endpoint) + ECR / Secrets Manager / KMS / CloudWatch Logs (interface endpoints)
Per-env differences:
- Staging uses
nat_strategy = "nat-instance"(t4g.nano, ~$3/mo, single point of failure) - Production uses
nat_strategy = "nat-gateway"(~$38/mo, per-AZ HA optional vianat_per_azvariable for F11)
Module: database-rds
RDS Postgres instance with parameter group, subnet group, monitoring, and snapshot policy. Used by production.
Inputs: env, instance_class, allocated_storage, max_allocated_storage, multi_az, backup_retention_period, vpc_id, subnet_ids, security_group_ids, kms_key_id
Outputs: instance_id, writer_endpoint, reader_endpoint (when read replicas exist), port, db_name
Key resources:
aws_db_subnet_groupaws_db_parameter_groupwith the platform's required parameters (rds.force_ssl=1,shared_preload_libraries=pg_stat_statements,max_connections=200, etc.)aws_db_instance(primary)aws_db_instance× N for read replicas (Phase 2+; gated on a count variable)aws_db_snapshottriggered manually via Terraform on intentional snapshot moments
Notes:
- Performance Insights enabled (free tier, 7-day retention)
- Enhanced Monitoring enabled at 1-min granularity
- The
master_user_passwordis generated viarandom_passwordresource and stored in Secrets Manager — never set as a literal in the .tfvars file apply_immediately = falsefor parameter changes — they apply during the next maintenance window unless explicitly forced
Module: database-aurora-serverless
Aurora Serverless v2 cluster. Used by staging.
Inputs: env, min_capacity (ACU), max_capacity (ACU), enable_scale_to_zero, vpc_id, subnet_ids, security_group_ids, kms_key_id, backup_retention_period
Outputs: cluster_endpoint, reader_endpoint, port
Key resources:
aws_db_subnet_groupaws_rds_cluster_parameter_group(Aurora-flavored equivalent of RDS parameter group, same effective parameters)aws_rds_clusterwithengine_mode = "provisioned"andserverlessv2_scaling_configurationaws_rds_cluster_instance× 1 (writer; no readers in staging)
Notes:
- Same wire protocol as
database-rds— application sees no difference - Scale-to-zero requires
min_capacity = 0and is a 2024-late feature — confirm Terraform AWS provider version supports it (>=v5.70) - Backup retention 1 day is acceptable for staging
Module: cache-redis
ElastiCache Redis cluster.
Inputs: env, node_type, num_cache_nodes, replicas (0 for staging, 1 for production), vpc_id, subnet_ids, security_group_ids
Outputs: primary_endpoint, port
Key resources:
aws_elasticache_subnet_groupaws_elasticache_replication_groupwithat_rest_encryption_enabled = true,transit_encryption_enabled = true
Notes:
- AUTH token generated via
random_password, stored in Secrets Manager - For staging: single-AZ single-node; for production: Multi-AZ primary + 1 replica
Module: ecs-cluster
The ECS cluster + capacity provider settings. Lightweight — most of the interesting configuration lives in ecs-service.
Inputs: env, enable_fargate_spot (true for staging app services, false for production)
Outputs: cluster_id, cluster_name
Key resources:
aws_ecs_clusteraws_ecs_cluster_capacity_providers(FARGATE + FARGATE_SPOT for staging)
Module: ecs-service (the workhorse)
A single Fargate service: task definition, service, target group, auto-scaling, IAM roles, log group. Every Fargate service in the architecture is one instantiation of this module.
Inputs: env, service_name, image_uri, port, cpu, memory, desired_count, min_capacity, max_capacity, target_cpu_pct, health_check_path, secrets_arns (list of Secrets Manager ARNs to inject), additional_iam_policy_arns, cluster_id, vpc_id, subnet_ids, security_group_ids, alb_listener_arn, host_header_values (for ALB host-based routing rules), use_spot (bool, staging pattern)
Outputs: service_arn, task_definition_arn, target_group_arn, task_role_arn
Key resources:
aws_ecs_task_definitionaws_ecs_serviceaws_lb_target_groupwith health check onhealth_check_pathaws_lb_listener_rulematchinghost_header_values→ forward to target groupaws_appautoscaling_target+aws_appautoscaling_policy(target tracking on CPU)aws_iam_rolefor the task (task_role) + execution (execution_role)aws_cloudwatch_log_groupwith environment-appropriate retention
Why this is the workhorse: every long-running process — Core API, Telemetry API, Clinic, Portal, Console, pgbouncer — is a module "X" { source = "../../modules/ecs-service" } block. Adding a new long-running service is a 30-line addition to infra/envs/{env}/main.tf.
Module: scheduled-tasks
EventBridge Scheduler rules + the IAM plumbing to invoke ECS RunTask on a schedule. Provisions the four foundation cron jobs called out in aws-infrastructure.md → Scheduled tasks and is reused by Layer 2 backups (storage-backups registers its own task definition; this module schedules it).
Inputs: env, cluster_arn, tasks (map of name → { task_definition_arn, container_name, command, schedule_expression, subnet_ids, security_group_ids, dead_letter_arn }), optional default_subnet_ids / default_security_group_ids to keep per-task config minimal
Outputs: schedule_arns (map of name → schedule ARN), scheduler_role_arn
Key resources:
aws_iam_role— single role assumed by EventBridge Scheduler; trustsscheduler.amazonaws.comaws_iam_role_policygrantingecs:RunTask(scoped by task-definition family) +iam:PassRole(scoped to each task'stask_roleandexecution_role)aws_scheduler_schedule— one per entry in thetasksmap, with target typeECSand the supplied container command overrideaws_cloudwatch_log_groupfor scheduler-side error logs (separate from the task's own log group)
Foundation cron registry (consumed by infra/envs/{env}/main.tf):
| Task | Schedule (UTC) | Image | Notes |
|---|---|---|---|
audit-partition-roll | Day 1 of each month, 02:00 | restartix-core-api | Provisions next 3 monthly partitions |
usage-quota-reset | Day 1 of each month, 00:05 | restartix-core-api | Resets quota counters and advances period |
usage-summary-rollup | Day 1 of each month, 03:00 | restartix-core-api | Closes prior month's usage summaries |
check-providers | Every 5 min (staging) / every 1 min (prod) | restartix-core-api | Healthchecks platform_service_providers |
backup-runner | Daily 02:00 (production); off in staging | restartix-core-api | Wired by storage-backups (output → input here) |
Notes:
- The four Core-API-image jobs share Core API's task role + execution role — different command, same task definition family. The
tasksinput takes thetask_definition_arnoutput by the Core APIecs-servicemodule. backup-runnerregisters its own task definition insidestorage-backups(separate IAM scope: S3 write-only on the backup bucket, RDS read on the database viaDATABASE_DIRECT_URL).storage-backupsoutputstask_definition_arn; the env composition feeds it intotasks["backup-runner"]here.- The schedule expression for
backup-runneris a per-env Terraform variable. Staging defaults tonull(rule disabled); production defaults tocron(0 2 * * ? *). Per backup-disaster-recovery.md → Staging knobs the staging cron is off by default. - Failure handling uses
aws_scheduler_schedule.flexible_time_windowwithOFF(deterministic timing) and a SQS dead-letter queue input viadead_letter_arnfor jobs that fail to launch (vs. fail mid-run, which CloudWatch alarms catch on the task side).
Module: storage-s3
S3 buckets with the platform's standard policy: versioning enabled, public access blocked, server-side encryption, lifecycle policy.
Inputs: env, bucket_name, lifecycle_strategy (one of none, audit-archive), enable_object_lock (bool, true for audit-archive), kms_key_id
Outputs: bucket_name, bucket_arn
Key resources:
aws_s3_bucketaws_s3_bucket_versioningaws_s3_bucket_server_side_encryption_configurationaws_s3_bucket_public_access_block(all four blocks: true)aws_s3_bucket_lifecycle_configuration(whenlifecycle_strategy = "audit-archive": Standard → Glacier IA at 90d → Deep Archive at 365d)aws_s3_bucket_object_lock_configuration(whenenable_object_lock = true, COMPLIANCE mode, 7-year retention)
Used for: the restartix-uploads-{env} bucket and the restartix-audit-archive-{env} bucket per aws-infrastructure.md → Object storage. The Layer 2 backup bucket is not provisioned here — it has different IAM, KMS, and immutability requirements and lives in storage-backups.
Module: storage-backups
The Layer 2 backup substrate per backup-disaster-recovery.md → Layer 2: immutable S3 bucket with a separate KMS context, the daily pg_dump ECS task definition, and the IAM role that bounds blast radius if the application credentials are ever compromised. Schedule wiring is done by scheduled-tasks — this module outputs the task-definition ARN.
Inputs: env, bucket_name (e.g. restartix-backups-primary-{env}), runner_image_uri, database_secret_arn (Secrets Manager ARN holding DATABASE_DIRECT_URL for pg_dump), backup_encryption_secret_arn (Secrets Manager ARN holding BACKUP_ENCRYPTION_KEY), cluster_arn, subnet_ids, security_group_ids, migrations_runner_sg_id
Outputs: bucket_name, bucket_arn, kms_key_arn, task_definition_arn, task_role_arn
Key resources:
aws_kms_key— separate from the platform's primary CMK. Different key policy: only the backup task role + the operations role canDecrypt; the application Fargate task role explicitly cannot. Different failure domain at the credentials layer.aws_s3_bucketwithaws_s3_bucket_versioning(enabled),aws_s3_bucket_object_lock_configuration(COMPLIANCEmode, 7-year retention),aws_s3_bucket_server_side_encryption_configuration(KMS, customer-managed key from this module),aws_s3_bucket_public_access_block(all four blocks: true)aws_s3_bucket_lifecycle_configuration(Standard → Glacier IR at 90d → Deep Archive at 730d, per the Layer 2 spec)aws_ecs_task_definition— the backup runner. Reuses Core API's image but with an entrypoint that runs thecmd/backup-runnerbinary (or a shell pipeline forpg_dump | gzip | openssl enc | aws s3 cp, decided in implementation). Sized small (0.25 vCPU / 1 GB).aws_iam_role(task_role) with:s3:PutObject+s3:GetObjecton the backup bucket only,kms:Decrypt/Encrypt/GenerateDataKeyon this module's CMK only,secretsmanager:GetSecretValueondatabase_secret_arnandbackup_encryption_secret_arnonly. Cannot read the application's Fargate task role permissions.aws_iam_role(execution_role) — minimal, ECR pull + CloudWatch Logs writeaws_security_group_ruleadding the backup runner's SG to themigrations-runnerfamily so it can reach RDS on 5432 directly (bypasses pgbouncer; pg_dump uses session-level state pgbouncer in transaction mode breaks)aws_cloudwatch_log_group/ecs/restartix-{env}/backup-runner
Per-env differences:
- Staging: bucket exists, IAM and task definition exist, KMS key exists. Cron is off by default (see
scheduled-tasksnotes). One end-to-end manual test against staging is the 1E.3 gate. - Production: same shape, cron on at launch via
scheduled-tasks.
Why a dedicated module instead of a lifecycle_strategy flag on storage-s3: the spec calls for a "different failure domain at the credentials layer" — separate IAM, separate KMS context, separate ECS task with restricted permissions. Folding all that into storage-s3 would dilute the module's responsibility and put backup-specific resources behind generic flags. Keeping it dedicated also makes "what does the backup runner have access to?" answerable by reading one module file.
Module: email-ses
SES domain identity, DKIM tokens, configuration set, and the SNS topic that receives bounce/complaint events for the foundation suppression list. The DKIM/SPF/DMARC DNS records live in edge-cloudflare (this module outputs the tokens; that module places them in Cloudflare DNS) — email-ses is the AWS-side half of the pair.
Inputs: env, sender_domain (e.g. notifications-staging.restartix.pro), bounce_webhook_url (Core API endpoint URL or null for staging), kms_key_arn (envelope key for the configuration set's IP-pool config, optional)
Outputs: dkim_records (list of { name, type = "CNAME", value } to feed into edge-cloudflare), spf_record, dmarc_record, configuration_set_name, bounce_topic_arn, complaint_topic_arn
Key resources:
aws_sesv2_email_identityfor the sender domainaws_sesv2_email_identity_dkim_signing_attributesto enable Easy DKIM and produce the three CNAME tokensaws_sesv2_configuration_set(restartix-{env}) withdelivery_options.tls_policy = "REQUIRE",reputation_options.enabled = true,sending_options.enabled = trueaws_sesv2_configuration_set_event_destination×2 — one routingBOUNCEevents to the bounce SNS topic, one routingCOMPLAINTevents to the complaint topic. The Core API endpoint subscribes via HTTPS subscription; SNS handles retry + DLQ.aws_sns_topic× 2 (restartix-{env}-ses-bounces,restartix-{env}-ses-complaints)aws_sns_topic_subscription× 2 (HTTPS, pointing atbounce_webhook_urlif set)aws_iam_policy_documentgrantingses:SendEmail/ses:SendRawEmailto the Core API task role, scoped to the configuration set ARN
Out of scope (manual or post-apply steps):
- Sandbox exit — AWS Support ticket, ~24-48h human turnaround. Not Terraform-able. Tracked in 1E.3 checklist.
- Sending-limit increases — also a support ticket.
- Cat A provider row in
platform_service_providers— populated bybootstrapProviderDefaultson first boot, not by Terraform. Therestartix/{env}/email-bootstrapSecrets Manager secret (created by the env composition, not this module) holds the seed values that bootstrapping reads once.
Why split SES from observability: the SES bounce/complaint topics are application data plane (a missed bounce notification means a recipient stays in the platform's send list), not operational alerting. They have different subscribers, different retention, different IAM. Different concerns, different module.
Module: observability
CloudWatch alarms, SNS topic for operational alert fan-out, AWS Chatbot integration with Slack.
Inputs: env, alarms_email_subscribers, slack_webhook_url, references to the resources to monitor (ALB ARN, RDS instance ID, ECS cluster name, etc.)
Outputs: sns_topic_arn, dashboard_url
Key resources:
aws_sns_topic(restartix-alerts-{env})aws_sns_topic_subscriptionfor email + Chatbotaws_cloudwatch_metric_alarm× N — see monitoring.md → CloudWatch alarms for the full listaws_cloudwatch_dashboard× per-area (database, API, app health, business metrics)
Module: deploy-iam
The GitHub Actions OIDC provider + the deploy IAM role. Run once per AWS account.
Inputs: github_org, github_repo, allowed_branches (default: ["master"])
Outputs: deploy_role_arn, oidc_provider_arn
Key resources:
aws_iam_openid_connect_provideraws_iam_role(the deploy role assumed by GitHub Actions OIDC)aws_iam_role_policywith the least-privilege deploy policy (ECR push, ECS update-service, ECS RunTask for migrations, Secrets Manager read onrestartix/{env}/*, CloudWatch Logs read)
Module: edge-cloudflare
Cloudflare resources via the cloudflare/cloudflare provider. Manages the configuration that lives at the edge — DNS records pointing at the AWS ALB, page rules, WAF custom rules, Cloudflare for SaaS hostnames API token.
Inputs: env, cloudflare_zone_id, alb_dns_name, cloudflare_account_id
Outputs: cf_api_token_secret_arn (token for the application to call Cloudflare for SaaS API)
Key resources:
cloudflare_recordfor each subdomain (*.clinic.restartix.pro,*.portal.restartix.pro,console.restartix.pro, the Cloudflare-origin hostname for SaaS)cloudflare_page_rulefor static-asset cache rulescloudflare_rulesetfor WAF custom rules (basic Day-1 ruleset, expanded post-launch)cloudflare_api_tokenscoped to Custom Hostnames management on the SaaS subdomain — token written to AWS Secrets Manager for the application to consume
Notes:
- The Cloudflare provider needs its own credentials — provided as the
CLOUDFLARE_API_TOKENenv var to Terraform (read from Secrets Manager via the deploy workflow, never committed) - Production and staging share the Cloudflare zone but have separate page rules + record sets
Module: tfstate-backend
Bootstraps the S3 bucket that holds Terraform state itself. Run once per AWS account during initial setup. State locking is handled by S3 native conditional writes — no DynamoDB table needed.
Inputs: bucket_name, lock_table_name
Outputs: state_bucket, lock_table
This is the only module designed to be terraform apply'd with local state first, then migrated to remote. See infra/README.md for the bootstrap procedure.
Per-tenant infrastructure (deferred — per-entitlement, not bundled)
Every module above is per-environment (staging, production). When entitlements that require per-tenant AWS resources ship, each ships independently as its own module + catalog entry. There is no single "dedicated tenant" bundle — the schema is intentionally a single-axis tenancy_mode enum + independent paid entitlements so each operational piece can ship on its own funded timeline.
Three deferred entitlements have per-tenant infrastructure shapes today:
own_s3_bucket— a per-tenant S3 module (bucket + bucket policy + CORS + IAM bindings), plumbing in the storage capability to route uploads to the per-org bucket, and the exit/portability tool that uses the per-org bucket to produce GDPR export packages. Available on either tenancy mode — a shared-mode clinic can buy this addon without becomingtenancy_mode='dedicated'.own_cmk— a per-tenant CMK module (KMS key + alias + key policy), plumbing in the encryption capability to wrap column-level encryption with the per-org CMK, and the crypto-shred runbook documenting how the key is retired on contract termination. Also available on either tenancy mode.tenancy_mode='dedicated'provisioner — the per-tenant Clerk org provisioning code in the auth-provider abstraction + the per-orgplatform_service_providersoverride row that binds the org to its dedicated identity namespace. Flipstenancy_mode='dedicated'from a schema reservation into a sellable product.
Each ships as one PR when the first paying customer funds it. The pattern when one lands is for_each over tenant rows that hold the relevant entitlement, sourced from a config file at infra/envs/dedicated/tenants.hcl (or a per-entitlement subdir if the shapes diverge). None of these are part of foundation 1E.3.
Full deferred scope: tenant-isolation.md → Deferred design surface. Related ADRs: Why tenancy_mode is a single enum, not multi-axis and Why Terraform PR + Console finalize for dedicated-mode provisioning.
Environment composition
infra/envs/staging/main.tf is the top-level composition. It instantiates every module above with staging-specific values.
# infra/envs/staging/main.tf — abbreviated
module "network" {
source = "../../modules/network"
env = "staging"
vpc_cidr = "10.10.0.0/16"
availability_zones = ["eu-central-1a", "eu-central-1b"]
nat_strategy = "nat-instance"
}
module "database" {
source = "../../modules/database-aurora-serverless"
env = "staging"
min_capacity = 0 # scale-to-zero
max_capacity = 2
enable_scale_to_zero = true
vpc_id = module.network.vpc_id
subnet_ids = module.network.private_subnet_ids
security_group_ids = [module.network.rds_sg]
backup_retention_period = 1
}
module "cache" {
source = "../../modules/cache-redis"
env = "staging"
node_type = "cache.t4g.micro"
num_cache_nodes = 1
replicas = 0
vpc_id = module.network.vpc_id
subnet_ids = module.network.private_subnet_ids
security_group_ids = [module.network.redis_sg]
}
module "ecs_cluster" {
source = "../../modules/ecs-cluster"
env = "staging"
enable_fargate_spot = true
}
module "core_api" {
source = "../../modules/ecs-service"
env = "staging"
service_name = "core-api"
image_uri = "${aws_ecr_repository.core_api.repository_url}:latest"
port = 9000
cpu = 512 # 0.5 vCPU
memory = 1024 # 1 GB
desired_count = 1
min_capacity = 1
max_capacity = 3
target_cpu_pct = 70
health_check_path = "/health"
use_spot = true
secrets_arns = [
aws_secretsmanager_secret.database.arn,
aws_secretsmanager_secret.redis.arn,
aws_secretsmanager_secret.encryption.arn,
aws_secretsmanager_secret.clerk.arn,
aws_secretsmanager_secret.external.arn,
]
cluster_id = module.ecs_cluster.cluster_id
vpc_id = module.network.vpc_id
subnet_ids = module.network.private_subnet_ids
security_group_ids = [module.network.fargate_app_sg]
alb_listener_arn = aws_lb_listener.https.arn
host_header_values = ["api-staging.restartix.pro"]
}
# ... clinic, portal, console, telemetry-api, pgbouncer, all similar shape ...
module "uploads_bucket" {
source = "../../modules/storage-s3"
env = "staging"
bucket_name = "restartix-uploads-staging"
lifecycle_strategy = "none"
}
module "audit_archive_bucket" {
source = "../../modules/storage-s3"
env = "staging"
bucket_name = "restartix-audit-archive-staging"
lifecycle_strategy = "audit-archive"
enable_object_lock = true
}
module "backups" {
source = "../../modules/storage-backups"
env = "staging"
bucket_name = "restartix-backups-primary-staging"
runner_image_uri = "${aws_ecr_repository.core_api.repository_url}:latest"
database_secret_arn = aws_secretsmanager_secret.database.arn
backup_encryption_secret_arn = aws_secretsmanager_secret.encryption.arn
cluster_arn = module.ecs_cluster.cluster_arn
subnet_ids = module.network.private_subnet_ids
security_group_ids = [module.network.migrations_runner_sg]
migrations_runner_sg_id = module.network.migrations_runner_sg
}
module "email" {
source = "../../modules/email-ses"
env = "staging"
sender_domain = "notifications-staging.restartix.pro"
bounce_webhook_url = null # Core API endpoint not wired in staging until 1E.3
}
module "scheduled_tasks" {
source = "../../modules/scheduled-tasks"
env = "staging"
cluster_arn = module.ecs_cluster.cluster_arn
default_subnet_ids = module.network.private_subnet_ids
default_security_group_ids = [module.network.fargate_app_sg]
tasks = {
audit-partition-roll = {
task_definition_arn = module.core_api.task_definition_arn
container_name = "core-api"
command = ["audit-partition-roll"]
schedule_expression = "cron(0 2 1 * ? *)"
}
usage-quota-reset = {
task_definition_arn = module.core_api.task_definition_arn
container_name = "core-api"
command = ["usage-quota-reset"]
schedule_expression = "cron(5 0 1 * ? *)"
}
usage-summary-rollup = {
task_definition_arn = module.core_api.task_definition_arn
container_name = "core-api"
command = ["usage-summary-rollup"]
schedule_expression = "cron(0 3 1 * ? *)"
}
check-providers = {
task_definition_arn = module.core_api.task_definition_arn
container_name = "core-api"
command = ["check-providers"]
schedule_expression = "rate(5 minutes)"
}
backup-runner = {
task_definition_arn = module.backups.task_definition_arn
container_name = "backup-runner"
command = []
schedule_expression = null # off by default in staging
subnet_ids = module.network.private_subnet_ids
security_group_ids = [module.network.migrations_runner_sg]
}
}
}
# ... observability, edge-cloudflare, deploy-iam ...infra/envs/production/main.tf is the same shape with production-grade values (RDS Multi-AZ, on-demand Fargate, NAT Gateway, replicas, longer log retention). Same modules → guaranteed shape parity.
Secrets handling
Terraform never holds long-lived secret values. The pattern:
- Terraform creates the Secrets Manager secret (the container)
- Terraform may set an initial value if it can be generated (e.g.,
random_passwordfor the RDS master password) - The production values (Clerk secret key, Daily.co API key, third-party API keys) are set out-of-band via
aws secretsmanager update-secretfrom a workstation with the operations role, never committed to Terraform .tfvars - The application reads secrets at task startup via the ECS task definition
secrets:block — values are resolved from Secrets Manager and injected as environment variables
terraform.tfvars files contain only non-secret environment-specific configuration (instance classes, region, AZ list, scaling bounds). Secrets stay in Secrets Manager.
Cloudflare API token for the Terraform cloudflare provider: stored in AWS Secrets Manager, fetched by the deploy workflow, set as CLOUDFLARE_API_TOKEN env var for the terraform apply step. Never committed.
Naming conventions
- Resource names:
restartix-{env}-{logical-name}. Example:restartix-staging-core-api(ECS service),restartix-production-uploads(S3 bucket). - Tags on every resource:
Project = "restartix"Environment = var.env(stagingorproduction)ManagedBy = "terraform"Owner = "platform-team"CostCenter = "platform"(used for AWS Cost Explorer breakdowns)
- Terraform module variables:
snake_case. Output names:snake_case. - AWS resource physical names: lowercase with hyphens.
How a typical change works
- Edit Terraform in your branch — for example, raise
max_capacityfor the Core API service from 5 to 8 - Run
terraform planlocally against staging:bashcd infra/envs/staging terraform plan - Review the diff — should show only the auto-scaling-target update, no other drift
- Open a PR with the Terraform change. CI runs
terraform planagain and posts the diff as a PR comment for reviewer visibility - Merge the PR. GitHub Actions runs
terraform applyagainst staging - After staging is verified, manually trigger the production apply workflow (which uses the same approval gate as application deploys)
Gotchas
- Drift between Terraform and Console clicks. If anything is changed via the AWS Console or CLI directly,
terraform planwill show it as drift on the next run. Don't change resources via the Console for anything Terraform owns — emergency exceptions logged in operations log and reconciled into Terraform within a week. - Terraform state file is sensitive. It contains output values from sensitive resources (DB endpoints, KMS key ARNs, secrets ARNs). The S3 backend is encrypted and access-controlled, but never
catthe state file in a shared session. terraform destroyof an environment is destructive. Aurora Serverless v2 retains a final snapshot only ifskip_final_snapshot = false(we set this). RDS likewise. Verify before running destroy.- Module changes affect every environment. A change to
infra/modules/ecs-service/ripples through staging and production both. Test against staging first. - Deploy IAM role permissions evolve. When adding a new resource type, the deploy role may need an additional IAM permission. The
deploy-iammodule's policy needs updating in lockstep.
Related documentation
- AWS infrastructure — what each module deploys (the resources themselves)
- Deployment & CI/CD — how Terraform fits into the deploy pipeline
- Scaling architecture — what to change when scaling levers fire
- Backup & DR — backup-specific resources (audit-archive bucket, RDS snapshot policy)
- Decisions — why Terraform, why state in S3, why module separation