Credential Rotation

Operator runbook for rotating Cat A provider credentials in platform_service_providers. Adopt this flow whenever a credential at AWS, Clerk, or any future Curated Provider is being rotated. The failure mode it prevents: a hard cutover that orphans in-flight requests still using the old credentials while some Core API instances haven't yet picked up the new row.

SQL is illustrative

SQL fragments and JSON shapes in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migration lives at services/api/migrations/core/000015_platform_service_providers.up.sql.

Background

Cat A capabilities (email, storage, auth, future sms / video / ai.text / payment) resolve their provider impl per call through internal/core/providers. The resolver reads platform_service_providers on AdminPool, decrypts credentials via the platform AES key, instantiates the provider impl, and caches it per-instance with a short TTL.

Property	Value
Resolution table	`platform_service_providers`
Resolver package	`services/api/internal/core/providers`
Cache TTL	`DefaultCacheTTL = 5 * time.Minute` (`resolver.go`)
Cache invalidation	Pull-based on `updated_at`; per-instance, no fan-out
Healthcheck	`cmd/check-providers` (deploy + cron)
Write surface	`PATCH /v1/admin/platform-service-providers/{id}` (Console superadmin)

The cache TTL is the rotation propagation window across the fleet. Every Core API instance re-resolves at TTL boundary, swapping to the new credentials without restart.

Standard rotation flow

The flow is the same for every Cat A provider; the provider-specific details (where to generate the new credential, what JSON shape Console expects) live in the per-provider sections below.

Generate new credentials at the provider. See the per-provider section for exact steps. Keep the old credentials around in your password manager until the cutover is fully verified — they're the rollback artifact.

Update the platform_service_providers row via Console. Superadmin endpoint:

curl -X PATCH \
  -H 'Authorization: Bearer <superadmin-token>' \
  -H 'Content-Type: application/json' \
  -d '{ "credentials": { ... }, "config": { ... } }' \
  https://console.restartix.pro/v1/admin/platform-service-providers/<id>

The Console encrypts the credentials payload via the platform AES key, writes it to credentials_encrypted, and bumps updated_at. State change audits with full diff.

Wait for the cache-TTL convergence window. Currently ~5 minutes (DefaultCacheTTL in internal/core/providers/resolver.go). Every Core API instance re-resolves at TTL boundary, swapping to the new credentials without restart.
Confirm status='active' across the fleet.
sh
```
make check-providers
```
Or run cmd/check-providers directly against the staging / prod DB. The healthcheck walks every row, optionally pings the provider with a no-op (e.g. SES GetSendQuota, S3 HeadBucket), and stamps status='error' + last_error on broken rows.
Revoke the old credentials at the provider. Only after step 4 confirms the new credentials work everywhere. Zero-downtime requires the provider to support two valid credential sets simultaneously — AWS IAM, Stripe, Twilio, SES, and Clerk all do.

Why we don't hot-cutover

Cache propagation is pull-based on TTL — until every instance re-resolves, some still hold the old impl. Cutting over old credentials the moment the row is updated would race in-flight requests. The ~5min TTL window is the safety net; it's larger than the p99 request budget on any Cat A capability we ship today.

The Console PATCH path calls Resolver.Invalidate(capability, orgID) for the local instance immediately after the row write, so the writing instance picks up the change on the next call. Cross-fleet propagation still relies on TTL expiry — Invalidate is per-instance and can't be fanned out remotely.

Per-provider runbooks

Email — capability `email`, provider_name `ses`

AWS SES is the platform default email provider. Per-org overrides exist for clinics that send from their own verified sender domain on a separate AWS sub-account (available on either tenancy mode as a visual-branding customization).

Where to generate new credentials.

Sign in to the AWS account that owns the SES identity (platform default = main AWS account; per-tenant override = clinic's dedicated sub-account).
IAM Console → Users → select the SES IAM user (e.g. restartix-ses-sender).
Security credentials → "Create access key" → use case "Application running outside AWS".
Save the access key ID and secret access key. AWS shows the secret once.

The IAM user's policy must allow ses:SendEmail, ses:SendRawEmail, ses:GetSendQuota (used by the healthcheck), and ses:GetIdentityVerificationAttributes for the from_address domain.

Console PATCH body shape.

json

{
  "credentials": {
    "access_key_id": "AKIA...",
    "secret_access_key": "..."
  },
  "config": {
    "region": "eu-central-1",
    "from_address": "[email protected]",
    "configuration_set": "restartix-default",
    "endpoint_url": ""
  }
}

config carries non-secret per-row settings:

Field	Purpose
`region`	AWS region of the SES identity. Must match the region the IAM key is issued in.
`from_address`	Verified sender. Must already be a verified SES identity in `region` (1A.18 / 1E gate).
`configuration_set`	SES configuration set used for bounce / complaint event publishing.
`endpoint_url`	Empty in production; used by integration tests to point at a LocalStack SES endpoint.

Provider-side rotation specifics. AWS IAM access keys support two simultaneous keys per user out of the box. The standard rotation:

Create the new access key (the user now has two active keys).
Run the Standard rotation flow above (steps 2–4) using the new key.
After the soak period, deactivate the old key in IAM (status → "Inactive") for 24 h to confirm nothing else uses it.
Delete the old key.

Failure modes.

status='error' with last_error containing InvalidClientTokenId or SignatureDoesNotMatch. The new credentials are wrong or were rotated at the provider before the row was updated. Mitigation: PATCH back to the old credentials, then debug the IAM key.
status='error' with MessageRejected: Email address is not verified. The from_address in config is not a verified SES identity in the target region. SES verification is a per-identity, per-region prereq — not part of credential rotation. Verify in SES Console first.
Healthcheck passes but real sends fail with Throttling. The IAM user's send quota is below traffic. Not a rotation problem; raise the SES sending quota.

Storage — capability `storage`, provider_name `aws_s3`

Stub

Rotation runbook lands when this provider ships in foundation. Migration of internal/integration/s3/ from env-config to the resolver is in scope for foundation 1C.2; the per-provider section here fills in alongside that work.

The expected shape (subject to change at landing time):

Credentials. {"access_key_id": "AKIA...", "secret_access_key": "..."} for an IAM user scoped to the platform's S3 bucket(s).
Config. {"bucket": "...", "region": "...", "use_path_style": false}.
Provider-side rotation. Same AWS IAM dual-key pattern as SES.

Auth — capability `auth`, provider_name `clerk`

Stub

Rotation runbook lands when this provider ships in foundation. The Clerk verifier abstraction is already provider-agnostic; foundation 1C.2 swaps the credentials source from env to the resolver. The per-provider section here fills in alongside that work.

The expected shape (subject to change at landing time):

Credentials. {"secret_key": "sk_live_...", "publishable_key": "pk_live_...", "webhook_secret": "whsec_..."}.
Config. {"jwks_url": "https://..."} or equivalent issuer config.
Provider-side rotation. Clerk supports rolling secret keys via the Clerk dashboard; the previous secret remains valid for the configured grace period. Webhook secret rotation requires updating both Clerk's signing secret and the Console row in lockstep — coordinate the two PATCHes.

Future Cat A providers

The provider whitelist in migration 000015 (chk_psp_capability_provider) restricts capability/provider combinations today to email/ses, storage/aws_s3, auth/clerk. Adding a new provider is a deliberate migration that extends the whitelist and adds a per-provider section to this doc in the same PR.

Capability	Provider	Status
`sms`	`twilio` (likely)	TBD when provider ships
`video`	`daily_co` (likely)	TBD when provider ships
`ai.text`	`anthropic` (likely)	TBD when provider ships
`payment`	`stripe` / `netopia` / `euplatesc`	TBD when provider ships

What to do when something goes wrong

Symptom	Likely cause	Mitigation
Healthcheck shows `status='error'` immediately after rotation.	New credentials rejected by the provider, or the encryption blob was mangled in transit.	PATCH back to the old credentials (kept in your password manager during the cutover window). Then debug the new credentials against the provider directly (`aws ses get-send-quota`, etc.) before retrying.
Some instances pass healthcheck, others fail.	Partial rotation — one instance still has stale cache.	`Invalidate` is per-instance; can't be fanned out remotely. Wait out the TTL or restart the failing instance.
Console PATCH succeeds but resolver still returns the old impl after >5 min.	Cache TTL was overridden via `Options.CacheTTL` to a longer value, or `cmd/api/main.go` mis-wired the resolver.	Check the resolver wiring in `cmd/api/main.go` for `providers.Options{CacheTTL: ...}`. If a longer TTL is intentional, wait it out; otherwise fix the wiring.
White-label override row is broken; calls fail 502 instead of falling back to platform default.	Locked behavior — fail-loud on broken per-org override (see foundation 1C.2 → Locked decisions).	Either repair the override row (PATCH with valid credentials), or set its `status='inactive'` via Console to remove it from resolution. The resolver will then resolve to the platform default.
Healthcheck row `last_error_at` is hours old and `status='active'`.	Healthcheck cron isn't running, or it's running but skipping this row.	Verify the cron schedule (5 min staging / 1 min prod). Manually run `cmd/check-providers` and confirm the row's `last_health_check_at` updates.

Foundation 1C.2 — Curated Providers (Cat A) + Provider Resolution — sub-phase design and locked decisions
internal/core/providers package doc — runtime resolver design
platform_service_providers migration — schema and RLS
patterns.md → P50 Capability Convention — the broader convention this rotation flow sits inside
glossary.md → Integration Categories (Cat A — Curated Provider) — canonical taxonomy
key-rotation.md — orthogonal procedure for rotating the platform AES key that protects credentials_encrypted

Credential Rotation ​

Background ​

Standard rotation flow ​

Why we don't hot-cutover ​

Per-provider runbooks ​

Email — capability email, provider_name ses ​

Storage — capability storage, provider_name aws_s3 ​

Auth — capability auth, provider_name clerk ​

Future Cat A providers ​

What to do when something goes wrong ​

Related ​