Credential Rotation
Operator runbook for rotating Cat A provider credentials in
platform_service_providers. Adopt this flow whenever a credential at AWS, Clerk, or any future Curated Provider is being rotated. The failure mode it prevents: a hard cutover that orphans in-flight requests still using the old credentials while some Core API instances haven't yet picked up the new row.
SQL is illustrative
SQL fragments and JSON shapes in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migration lives at services/api/migrations/core/000015_platform_service_providers.up.sql.
Background
Cat A capabilities (email, storage, auth, future sms / video / ai.text / payment) resolve their provider impl per call through internal/core/providers. The resolver reads platform_service_providers on AdminPool, decrypts credentials via the platform AES key, instantiates the provider impl, and caches it per-instance with a short TTL.
| Property | Value |
|---|---|
| Resolution table | platform_service_providers |
| Resolver package | services/api/internal/core/providers |
| Cache TTL | DefaultCacheTTL = 5 * time.Minute (resolver.go) |
| Cache invalidation | Pull-based on updated_at; per-instance, no fan-out |
| Healthcheck | cmd/check-providers (deploy + cron) |
| Write surface | PATCH /v1/admin/platform-service-providers/{id} (Console superadmin) |
The cache TTL is the rotation propagation window across the fleet. Every Core API instance re-resolves at TTL boundary, swapping to the new credentials without restart.
Standard rotation flow
The flow is the same for every Cat A provider; the provider-specific details (where to generate the new credential, what JSON shape Console expects) live in the per-provider sections below.
Generate new credentials at the provider. See the per-provider section for exact steps. Keep the old credentials around in your password manager until the cutover is fully verified — they're the rollback artifact.
Update the
platform_service_providersrow via Console. Superadmin endpoint:shcurl -X PATCH \ -H 'Authorization: Bearer <superadmin-token>' \ -H 'Content-Type: application/json' \ -d '{ "credentials": { ... }, "config": { ... } }' \ https://console.restartix.pro/v1/admin/platform-service-providers/<id>The Console encrypts the
credentialspayload via the platform AES key, writes it tocredentials_encrypted, and bumpsupdated_at. State change audits with full diff.Wait for the cache-TTL convergence window. Currently ~5 minutes (
DefaultCacheTTLininternal/core/providers/resolver.go). Every Core API instance re-resolves at TTL boundary, swapping to the new credentials without restart.Confirm
status='active'across the fleet.shmake check-providersOr run
cmd/check-providersdirectly against the staging / prod DB. The healthcheck walks every row, optionally pings the provider with a no-op (e.g. SESGetSendQuota, S3HeadBucket), and stampsstatus='error'+last_erroron broken rows.Revoke the old credentials at the provider. Only after step 4 confirms the new credentials work everywhere. Zero-downtime requires the provider to support two valid credential sets simultaneously — AWS IAM, Stripe, Twilio, SES, and Clerk all do.
Why we don't hot-cutover
Cache propagation is pull-based on TTL — until every instance re-resolves, some still hold the old impl. Cutting over old credentials the moment the row is updated would race in-flight requests. The ~5min TTL window is the safety net; it's larger than the p99 request budget on any Cat A capability we ship today.
The Console PATCH path calls Resolver.Invalidate(capability, orgID) for the local instance immediately after the row write, so the writing instance picks up the change on the next call. Cross-fleet propagation still relies on TTL expiry — Invalidate is per-instance and can't be fanned out remotely.
Per-provider runbooks
Email — capability email, provider_name ses
AWS SES is the platform default email provider. Per-org overrides exist for clinics that send from their own verified sender domain on a separate AWS sub-account (available on either tenancy mode as a visual-branding customization).
Where to generate new credentials.
- Sign in to the AWS account that owns the SES identity (platform default = main AWS account; per-tenant override = clinic's dedicated sub-account).
- IAM Console → Users → select the SES IAM user (e.g.
restartix-ses-sender). - Security credentials → "Create access key" → use case "Application running outside AWS".
- Save the access key ID and secret access key. AWS shows the secret once.
The IAM user's policy must allow ses:SendEmail, ses:SendRawEmail, ses:GetSendQuota (used by the healthcheck), and ses:GetIdentityVerificationAttributes for the from_address domain.
Console PATCH body shape.
{
"credentials": {
"access_key_id": "AKIA...",
"secret_access_key": "..."
},
"config": {
"region": "eu-central-1",
"from_address": "[email protected]",
"configuration_set": "restartix-default",
"endpoint_url": ""
}
}config carries non-secret per-row settings:
| Field | Purpose |
|---|---|
region | AWS region of the SES identity. Must match the region the IAM key is issued in. |
from_address | Verified sender. Must already be a verified SES identity in region (1A.18 / 1E gate). |
configuration_set | SES configuration set used for bounce / complaint event publishing. |
endpoint_url | Empty in production; used by integration tests to point at a LocalStack SES endpoint. |
Provider-side rotation specifics. AWS IAM access keys support two simultaneous keys per user out of the box. The standard rotation:
- Create the new access key (the user now has two active keys).
- Run the Standard rotation flow above (steps 2–4) using the new key.
- After the soak period, deactivate the old key in IAM (status → "Inactive") for 24 h to confirm nothing else uses it.
- Delete the old key.
Failure modes.
status='error'withlast_errorcontainingInvalidClientTokenIdorSignatureDoesNotMatch. The new credentials are wrong or were rotated at the provider before the row was updated. Mitigation: PATCH back to the old credentials, then debug the IAM key.status='error'withMessageRejected: Email address is not verified. Thefrom_addressinconfigis not a verified SES identity in the target region. SES verification is a per-identity, per-region prereq — not part of credential rotation. Verify in SES Console first.- Healthcheck passes but real sends fail with
Throttling. The IAM user's send quota is below traffic. Not a rotation problem; raise the SES sending quota.
Storage — capability storage, provider_name aws_s3
Stub
Rotation runbook lands when this provider ships in foundation. Migration of internal/integration/s3/ from env-config to the resolver is in scope for foundation 1C.2; the per-provider section here fills in alongside that work.
The expected shape (subject to change at landing time):
- Credentials.
{"access_key_id": "AKIA...", "secret_access_key": "..."}for an IAM user scoped to the platform's S3 bucket(s). - Config.
{"bucket": "...", "region": "...", "use_path_style": false}. - Provider-side rotation. Same AWS IAM dual-key pattern as SES.
Auth — capability auth, provider_name clerk
Stub
Rotation runbook lands when this provider ships in foundation. The Clerk verifier abstraction is already provider-agnostic; foundation 1C.2 swaps the credentials source from env to the resolver. The per-provider section here fills in alongside that work.
The expected shape (subject to change at landing time):
- Credentials.
{"secret_key": "sk_live_...", "publishable_key": "pk_live_...", "webhook_secret": "whsec_..."}. - Config.
{"jwks_url": "https://..."}or equivalent issuer config. - Provider-side rotation. Clerk supports rolling secret keys via the Clerk dashboard; the previous secret remains valid for the configured grace period. Webhook secret rotation requires updating both Clerk's signing secret and the Console row in lockstep — coordinate the two PATCHes.
Future Cat A providers
The provider whitelist in migration 000015 (chk_psp_capability_provider) restricts capability/provider combinations today to email/ses, storage/aws_s3, auth/clerk. Adding a new provider is a deliberate migration that extends the whitelist and adds a per-provider section to this doc in the same PR.
| Capability | Provider | Status |
|---|---|---|
sms | twilio (likely) | TBD when provider ships |
video | daily_co (likely) | TBD when provider ships |
ai.text | anthropic (likely) | TBD when provider ships |
payment | stripe / netopia / euplatesc | TBD when provider ships |
What to do when something goes wrong
| Symptom | Likely cause | Mitigation |
|---|---|---|
Healthcheck shows status='error' immediately after rotation. | New credentials rejected by the provider, or the encryption blob was mangled in transit. | PATCH back to the old credentials (kept in your password manager during the cutover window). Then debug the new credentials against the provider directly (aws ses get-send-quota, etc.) before retrying. |
| Some instances pass healthcheck, others fail. | Partial rotation — one instance still has stale cache. | Invalidate is per-instance; can't be fanned out remotely. Wait out the TTL or restart the failing instance. |
| Console PATCH succeeds but resolver still returns the old impl after >5 min. | Cache TTL was overridden via Options.CacheTTL to a longer value, or cmd/api/main.go mis-wired the resolver. | Check the resolver wiring in cmd/api/main.go for providers.Options{CacheTTL: ...}. If a longer TTL is intentional, wait it out; otherwise fix the wiring. |
| White-label override row is broken; calls fail 502 instead of falling back to platform default. | Locked behavior — fail-loud on broken per-org override (see foundation 1C.2 → Locked decisions). | Either repair the override row (PATCH with valid credentials), or set its status='inactive' via Console to remove it from resolution. The resolver will then resolve to the platform default. |
Healthcheck row last_error_at is hours old and status='active'. | Healthcheck cron isn't running, or it's running but skipping this row. | Verify the cron schedule (5 min staging / 1 min prod). Manually run cmd/check-providers and confirm the row's last_health_check_at updates. |
Related
- Foundation 1C.2 — Curated Providers (Cat A) + Provider Resolution — sub-phase design and locked decisions
internal/core/providerspackage doc — runtime resolver designplatform_service_providersmigration — schema and RLS- patterns.md → P50 Capability Convention — the broader convention this rotation flow sits inside
- glossary.md → Integration Categories (Cat A — Curated Provider) — canonical taxonomy
- key-rotation.md — orthogonal procedure for rotating the platform AES key that protects
credentials_encrypted