Encryption Key Rotation Procedure
Operational runbook for rotating the application-level AES-256-GCM key used by
services/api/internal/core/crypto/. Companion to encryption.md.
SQL is illustrative
SQL fragments in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migrations live in services/api/migrations/core/.
Status
The keyring infrastructure that makes online rotation possible is shipped today (1A.3): version-stamped ciphertext, multi-version Keyring interface, dual-decrypt during rotation, crypto.Init accepting either InMemoryKeyring (Phase 1, all envs) or the kmsKeyring stub (Phase 2 trigger).
Not yet shipped:
- The direct
crypto.kmsKeyringreal implementation — Phase 2 work (per-data-key KMS calls + per-tenant key custody / BYOK). Phase 1 (current, all envs including production) runsInMemoryKeyringloaded from the KMS-envelope-protectedrestartix/{env}/encryptionSecrets Manager secret — the keyring IS KMS-rooted in production, just at the SM envelope layer rather than per-data-key. See aws-infrastructure.md → Direct-KMS keyring + BYOK (Phase 2+). - The re-encryption tool — landing with the first encrypted column under live data (today only
organization_billing.tax_id_encryptedcarries any rows, and only post-1B). Until that exists, rotation is "deploy with both versions inENCRYPTION_KEYS" and there is no data to re-encrypt. - Production observability (CloudWatch alarms for "no rotation in 95 days", PagerDuty escalation, automated rotation reminders). These land in Layer 12.
The procedure below describes the full target operation. Steps marked (planned) depend on the items above and cannot be executed today.
Frequency
- Quarterly (every ~90 days) once production is live.
- Immediate on suspected key compromise.
- Before deprovisioning any human who held access to a key.
What you'll touch
| Where | What |
|---|---|
services/api/internal/core/config/config.go | Reads ENCRYPTION_KEYS (CSV version:hex,version:hex) and ACTIVE_ENCRYPTION_VERSION (int). |
services/api/internal/core/crypto/ | Init, Encrypt, Decrypt, Keyring, LoadInMemoryKeyringFromEnv, KMSKeyring (stub today). |
.env.local (dev) / AWS Secrets Manager restartix/{env}/encryption (staging+prod, KMS-envelope-protected, CMK provisioned at 1E.3) | The keys themselves. |
There is no cmd/tools/reencrypt/ yet, no crypto.NewVersionedEncryptor, no crypto.NeedsReEncrypt. The real entry points are crypto.Init(keyring) at startup and the package-level Encrypt / Decrypt.
Procedure
1. Generate the new key
openssl rand -hex 32
# → 64 hex chars; this is your new key V<N+1>.Store it in the same place as the active key (Secrets Manager in staging/prod, password manager + .env.local in dev).
2. Add it alongside the active key
The ENCRYPTION_KEYS env var carries every key the keyring should know about. Add the new version without removing the old one:
# Before — only V1 is loaded
ENCRYPTION_KEYS=1:<old_hex>
ACTIVE_ENCRYPTION_VERSION=1
# After step 2 — both versions loaded, V1 still active
ENCRYPTION_KEYS=1:<old_hex>,2:<new_hex>
ACTIVE_ENCRYPTION_VERSION=1The keyring built by LoadInMemoryKeyringFromEnv decrypts blobs from any version listed in ENCRYPTION_KEYS; new encryptions still use the version pinned by ACTIVE_ENCRYPTION_VERSION.
Deploy. Verify the process started:
curl -s http://localhost:9000/health | jq .3. Promote the new version to active
ENCRYPTION_KEYS=1:<old_hex>,2:<new_hex>
ACTIVE_ENCRYPTION_VERSION=2Deploy. From this point on, every new Encrypt call seals under V2. Existing rows still carry the V1 version byte and decrypt fine because V1 is still in the keyring.
4. Re-encrypt existing rows (planned — Layer 2+)
The re-encryption tool will live at services/api/cmd/reencrypt/ and run as a one-shot job:
- Iterate every
_encrypted BYTEAcolumn registered in the tool's manifest. - For each row whose
blob[0](the version byte) is notkeyring.ActiveVersion(), decrypt → re-encrypt →UPDATE. - Idempotent: a second run after success no-ops.
The tool ships when the first encrypted column carries production data (organization_billing.tax_id_encrypted is the canonical example today; future auth_secret columns will follow). Until then there is no production data sealed under any key version, so this step is a no-op.
Verification SQL (once data exists):
SELECT count(*)
FROM organization_billing
WHERE tax_id_encrypted IS NOT NULL
AND get_byte(tax_id_encrypted, 0) <> 2; -- 0 ⇒ all rows are on V25. Remove the old key
After the soak period (24 h is the default; longer for irreversible operations), drop V1:
ENCRYPTION_KEYS=2:<new_hex>
ACTIVE_ENCRYPTION_VERSION=2Deploy. Any blob still carrying \x01 as its version byte will fail Decrypt with ErrUnknownKeyVersion — re-run the re-encryption tool first if the previous step's verification didn't return zero.
6. Audit-log the rotation
Today, audit rows are written by handler code via the internal/core/audit recorder. There is no handler that runs key rotation, so the only way to record it is to write a row directly against the AdminPool (acceptable because rotation is a deploy-adjacent operation):
-- Operator with platform_memberships role='superadmin' performs this against the AdminPool.
INSERT INTO audit_log (
id, organization_id, user_id,
action, entity_type, entity_id,
changes, action_context, created_at
) VALUES (
gen_random_uuid(),
NULL, -- platform-level event, no org scope
'<superadmin-user-uuid>',
'KEY_ROTATION',
'encryption_key',
NULL,
jsonb_build_object('old_version', 1, 'new_version', 2),
'compliance_maintenance',
NOW()
);When the audit-log read API ships (Layer 1.13), the same row is queryable through GET /v1/audit-logs?action=KEY_ROTATION.
Emergency rotation (suspected key compromise)
Same procedure, compressed:
- Generate V<N+1>.
- Add it and immediately promote it active (skip the soak between steps 2 and 3 — both deploys back-to-back).
- Run re-encryption (when the tool exists).
- Remove the old key without waiting 24 h.
- Audit-log with
action_context = 'security_incident'and a free-text incident ID. - Review the audit log for
entity_type = 'patient_profiles'(etc.) covering the suspected exposure window — break-glass access pattern queries become available once the read API ships.
Rollback
If a rotation deploy regresses something, the old key is still loaded as long as you didn't run step 5 yet. Revert ACTIVE_ENCRYPTION_VERSION to the previous version and redeploy. New encryptions go back to V1; nothing needs re-encryption because no data was changed under V2 (or the tiny window of changed data still decrypts fine because V2 stays in the keyring during rollback).
If you already removed the old key (step 5), rolling back means re-introducing it and also re-running the re-encryption tool with the old version as ACTIVE_ENCRYPTION_VERSION. Avoid being in this position by not running step 5 until you're confident.
Compliance pointers
- HIPAA §164.312(a)(2)(iv) — encryption at rest. AES-256-GCM with versioned keyring satisfies this; the periodic rotation cadence above satisfies §164.308(a)(8) evaluation.
- GDPR Article 32(1)(a) — encryption as a technical measure. Same helper, same cadence.
The encryption design (cached DEKs vs per-record envelope, version byte in the blob, no AAD in v1) is documented in decisions.md — read those before changing the wire format.
Related
- encryption.md — design and call-site usage of the
cryptopackage - decisions.md — why this shape and not envelope-per-record
- reference/rls-policies.md — orthogonal isolation layer; encryption protects bytes, RLS protects rows