Skip to content

Credential Rotation

Operator runbook for rotating Cat A provider credentials in platform_service_providers. Adopt this flow whenever a credential at AWS, Clerk, or any future Curated Provider is being rotated. The failure mode it prevents: a hard cutover that orphans in-flight requests still using the old credentials while some Core API instances haven't yet picked up the new row.

SQL is illustrative

SQL fragments and JSON shapes in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migration lives at services/api/migrations/core/000015_platform_service_providers.up.sql.

Background

Cat A capabilities (email, storage, auth, future sms / video / ai.text / payment) resolve their provider impl per call through internal/core/providers. The resolver reads platform_service_providers on AdminPool, decrypts credentials via the platform AES key, instantiates the provider impl, and caches it per-instance with a short TTL.

PropertyValue
Resolution tableplatform_service_providers
Resolver packageservices/api/internal/core/providers
Cache TTLDefaultCacheTTL = 5 * time.Minute (resolver.go)
Cache invalidationPull-based on updated_at; per-instance, no fan-out
Healthcheckcmd/check-providers (deploy + cron)
Write surfacePATCH /v1/admin/platform-service-providers/{id} (Console superadmin)

The cache TTL is the rotation propagation window across the fleet. Every Core API instance re-resolves at TTL boundary, swapping to the new credentials without restart.


Standard rotation flow

The flow is the same for every Cat A provider; the provider-specific details (where to generate the new credential, what JSON shape Console expects) live in the per-provider sections below.

  1. Generate new credentials at the provider. See the per-provider section for exact steps. Keep the old credentials around in your password manager until the cutover is fully verified — they're the rollback artifact.

  2. Update the platform_service_providers row via Console. Superadmin endpoint:

    sh
    curl -X PATCH \
      -H 'Authorization: Bearer <superadmin-token>' \
      -H 'Content-Type: application/json' \
      -d '{ "credentials": { ... }, "config": { ... } }' \
      https://console.restartix.pro/v1/admin/platform-service-providers/<id>

    The Console encrypts the credentials payload via the platform AES key, writes it to credentials_encrypted, and bumps updated_at. State change audits with full diff.

  3. Wait for the cache-TTL convergence window. Currently ~5 minutes (DefaultCacheTTL in internal/core/providers/resolver.go). Every Core API instance re-resolves at TTL boundary, swapping to the new credentials without restart.

  4. Confirm status='active' across the fleet.

    sh
    make check-providers

    Or run cmd/check-providers directly against the staging / prod DB. The healthcheck walks every row, optionally pings the provider with a no-op (e.g. SES GetSendQuota, S3 HeadBucket), and stamps status='error' + last_error on broken rows.

  5. Revoke the old credentials at the provider. Only after step 4 confirms the new credentials work everywhere. Zero-downtime requires the provider to support two valid credential sets simultaneously — AWS IAM, Stripe, Twilio, SES, and Clerk all do.

Why we don't hot-cutover

Cache propagation is pull-based on TTL — until every instance re-resolves, some still hold the old impl. Cutting over old credentials the moment the row is updated would race in-flight requests. The ~5min TTL window is the safety net; it's larger than the p99 request budget on any Cat A capability we ship today.

The Console PATCH path calls Resolver.Invalidate(capability, orgID) for the local instance immediately after the row write, so the writing instance picks up the change on the next call. Cross-fleet propagation still relies on TTL expiry — Invalidate is per-instance and can't be fanned out remotely.


Per-provider runbooks

Email — capability email, provider_name ses

AWS SES is the platform default email provider. Per-org overrides exist for clinics that send from their own verified sender domain on a separate AWS sub-account (available on either tenancy mode as a visual-branding customization).

Where to generate new credentials.

  1. Sign in to the AWS account that owns the SES identity (platform default = main AWS account; per-tenant override = clinic's dedicated sub-account).
  2. IAM Console → Users → select the SES IAM user (e.g. restartix-ses-sender).
  3. Security credentials → "Create access key" → use case "Application running outside AWS".
  4. Save the access key ID and secret access key. AWS shows the secret once.

The IAM user's policy must allow ses:SendEmail, ses:SendRawEmail, ses:GetSendQuota (used by the healthcheck), and ses:GetIdentityVerificationAttributes for the from_address domain.

Console PATCH body shape.

json
{
  "credentials": {
    "access_key_id": "AKIA...",
    "secret_access_key": "..."
  },
  "config": {
    "region": "eu-central-1",
    "from_address": "[email protected]",
    "configuration_set": "restartix-default",
    "endpoint_url": ""
  }
}

config carries non-secret per-row settings:

FieldPurpose
regionAWS region of the SES identity. Must match the region the IAM key is issued in.
from_addressVerified sender. Must already be a verified SES identity in region (1A.18 / 1E gate).
configuration_setSES configuration set used for bounce / complaint event publishing.
endpoint_urlEmpty in production; used by integration tests to point at a LocalStack SES endpoint.

Provider-side rotation specifics. AWS IAM access keys support two simultaneous keys per user out of the box. The standard rotation:

  1. Create the new access key (the user now has two active keys).
  2. Run the Standard rotation flow above (steps 2–4) using the new key.
  3. After the soak period, deactivate the old key in IAM (status → "Inactive") for 24 h to confirm nothing else uses it.
  4. Delete the old key.

Failure modes.

  • status='error' with last_error containing InvalidClientTokenId or SignatureDoesNotMatch. The new credentials are wrong or were rotated at the provider before the row was updated. Mitigation: PATCH back to the old credentials, then debug the IAM key.
  • status='error' with MessageRejected: Email address is not verified. The from_address in config is not a verified SES identity in the target region. SES verification is a per-identity, per-region prereq — not part of credential rotation. Verify in SES Console first.
  • Healthcheck passes but real sends fail with Throttling. The IAM user's send quota is below traffic. Not a rotation problem; raise the SES sending quota.

Storage — capability storage, provider_name aws_s3

Stub

Rotation runbook lands when this provider ships in foundation. Migration of internal/integration/s3/ from env-config to the resolver is in scope for foundation 1C.2; the per-provider section here fills in alongside that work.

The expected shape (subject to change at landing time):

  • Credentials. {"access_key_id": "AKIA...", "secret_access_key": "..."} for an IAM user scoped to the platform's S3 bucket(s).
  • Config. {"bucket": "...", "region": "...", "use_path_style": false}.
  • Provider-side rotation. Same AWS IAM dual-key pattern as SES.

Auth — capability auth, provider_name clerk

Stub

Rotation runbook lands when this provider ships in foundation. The Clerk verifier abstraction is already provider-agnostic; foundation 1C.2 swaps the credentials source from env to the resolver. The per-provider section here fills in alongside that work.

The expected shape (subject to change at landing time):

  • Credentials. {"secret_key": "sk_live_...", "publishable_key": "pk_live_...", "webhook_secret": "whsec_..."}.
  • Config. {"jwks_url": "https://..."} or equivalent issuer config.
  • Provider-side rotation. Clerk supports rolling secret keys via the Clerk dashboard; the previous secret remains valid for the configured grace period. Webhook secret rotation requires updating both Clerk's signing secret and the Console row in lockstep — coordinate the two PATCHes.

Future Cat A providers

The provider whitelist in migration 000015 (chk_psp_capability_provider) restricts capability/provider combinations today to email/ses, storage/aws_s3, auth/clerk. Adding a new provider is a deliberate migration that extends the whitelist and adds a per-provider section to this doc in the same PR.

CapabilityProviderStatus
smstwilio (likely)TBD when provider ships
videodaily_co (likely)TBD when provider ships
ai.textanthropic (likely)TBD when provider ships
paymentstripe / netopia / euplatescTBD when provider ships

What to do when something goes wrong

SymptomLikely causeMitigation
Healthcheck shows status='error' immediately after rotation.New credentials rejected by the provider, or the encryption blob was mangled in transit.PATCH back to the old credentials (kept in your password manager during the cutover window). Then debug the new credentials against the provider directly (aws ses get-send-quota, etc.) before retrying.
Some instances pass healthcheck, others fail.Partial rotation — one instance still has stale cache.Invalidate is per-instance; can't be fanned out remotely. Wait out the TTL or restart the failing instance.
Console PATCH succeeds but resolver still returns the old impl after >5 min.Cache TTL was overridden via Options.CacheTTL to a longer value, or cmd/api/main.go mis-wired the resolver.Check the resolver wiring in cmd/api/main.go for providers.Options{CacheTTL: ...}. If a longer TTL is intentional, wait it out; otherwise fix the wiring.
White-label override row is broken; calls fail 502 instead of falling back to platform default.Locked behavior — fail-loud on broken per-org override (see foundation 1C.2 → Locked decisions).Either repair the override row (PATCH with valid credentials), or set its status='inactive' via Console to remove it from resolution. The resolver will then resolve to the platform default.
Healthcheck row last_error_at is hours old and status='active'.Healthcheck cron isn't running, or it's running but skipping this row.Verify the cron schedule (5 min staging / 1 min prod). Manually run cmd/check-providers and confirm the row's last_health_check_at updates.