Skip to content

Encryption Key Rotation Procedure

Operational runbook for rotating the application-level AES-256-GCM key used by services/api/internal/core/crypto/. Companion to encryption.md.

SQL is illustrative

SQL fragments in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migrations live in services/api/migrations/core/.

Status

The keyring infrastructure that makes online rotation possible is shipped today (1A.3): version-stamped ciphertext, multi-version Keyring interface, dual-decrypt during rotation, crypto.Init accepting either InMemoryKeyring (Phase 1, all envs) or the kmsKeyring stub (Phase 2 trigger).

Not yet shipped:

  • The direct crypto.kmsKeyring real implementation — Phase 2 work (per-data-key KMS calls + per-tenant key custody / BYOK). Phase 1 (current, all envs including production) runs InMemoryKeyring loaded from the KMS-envelope-protected restartix/{env}/encryption Secrets Manager secret — the keyring IS KMS-rooted in production, just at the SM envelope layer rather than per-data-key. See aws-infrastructure.md → Direct-KMS keyring + BYOK (Phase 2+).
  • The re-encryption tool — landing with the first encrypted column under live data (today only organization_billing.tax_id_encrypted carries any rows, and only post-1B). Until that exists, rotation is "deploy with both versions in ENCRYPTION_KEYS" and there is no data to re-encrypt.
  • Production observability (CloudWatch alarms for "no rotation in 95 days", PagerDuty escalation, automated rotation reminders). These land in Layer 12.

The procedure below describes the full target operation. Steps marked (planned) depend on the items above and cannot be executed today.

Frequency

  • Quarterly (every ~90 days) once production is live.
  • Immediate on suspected key compromise.
  • Before deprovisioning any human who held access to a key.

What you'll touch

WhereWhat
services/api/internal/core/config/config.goReads ENCRYPTION_KEYS (CSV version:hex,version:hex) and ACTIVE_ENCRYPTION_VERSION (int).
services/api/internal/core/crypto/Init, Encrypt, Decrypt, Keyring, LoadInMemoryKeyringFromEnv, KMSKeyring (stub today).
.env.local (dev) / AWS Secrets Manager restartix/{env}/encryption (staging+prod, KMS-envelope-protected, CMK provisioned at 1E.3)The keys themselves.

There is no cmd/tools/reencrypt/ yet, no crypto.NewVersionedEncryptor, no crypto.NeedsReEncrypt. The real entry points are crypto.Init(keyring) at startup and the package-level Encrypt / Decrypt.

Procedure

1. Generate the new key

bash
openssl rand -hex 32
# → 64 hex chars; this is your new key V<N+1>.

Store it in the same place as the active key (Secrets Manager in staging/prod, password manager + .env.local in dev).

2. Add it alongside the active key

The ENCRYPTION_KEYS env var carries every key the keyring should know about. Add the new version without removing the old one:

bash
# Before — only V1 is loaded
ENCRYPTION_KEYS=1:<old_hex>
ACTIVE_ENCRYPTION_VERSION=1

# After step 2 — both versions loaded, V1 still active
ENCRYPTION_KEYS=1:<old_hex>,2:<new_hex>
ACTIVE_ENCRYPTION_VERSION=1

The keyring built by LoadInMemoryKeyringFromEnv decrypts blobs from any version listed in ENCRYPTION_KEYS; new encryptions still use the version pinned by ACTIVE_ENCRYPTION_VERSION.

Deploy. Verify the process started:

bash
curl -s http://localhost:9000/health | jq .

3. Promote the new version to active

bash
ENCRYPTION_KEYS=1:<old_hex>,2:<new_hex>
ACTIVE_ENCRYPTION_VERSION=2

Deploy. From this point on, every new Encrypt call seals under V2. Existing rows still carry the V1 version byte and decrypt fine because V1 is still in the keyring.

4. Re-encrypt existing rows (planned — Layer 2+)

The re-encryption tool will live at services/api/cmd/reencrypt/ and run as a one-shot job:

  • Iterate every _encrypted BYTEA column registered in the tool's manifest.
  • For each row whose blob[0] (the version byte) is not keyring.ActiveVersion(), decrypt → re-encrypt → UPDATE.
  • Idempotent: a second run after success no-ops.

The tool ships when the first encrypted column carries production data (organization_billing.tax_id_encrypted is the canonical example today; future auth_secret columns will follow). Until then there is no production data sealed under any key version, so this step is a no-op.

Verification SQL (once data exists):

sql
SELECT count(*)
FROM organization_billing
WHERE tax_id_encrypted IS NOT NULL
  AND get_byte(tax_id_encrypted, 0) <> 2;        -- 0 ⇒ all rows are on V2

5. Remove the old key

After the soak period (24 h is the default; longer for irreversible operations), drop V1:

bash
ENCRYPTION_KEYS=2:<new_hex>
ACTIVE_ENCRYPTION_VERSION=2

Deploy. Any blob still carrying \x01 as its version byte will fail Decrypt with ErrUnknownKeyVersion — re-run the re-encryption tool first if the previous step's verification didn't return zero.

6. Audit-log the rotation

Today, audit rows are written by handler code via the internal/core/audit recorder. There is no handler that runs key rotation, so the only way to record it is to write a row directly against the AdminPool (acceptable because rotation is a deploy-adjacent operation):

sql
-- Operator with platform_memberships role='superadmin' performs this against the AdminPool.
INSERT INTO audit_log (
    id, organization_id, user_id,
    action, entity_type, entity_id,
    changes, action_context, created_at
) VALUES (
    gen_random_uuid(),
    NULL,                   -- platform-level event, no org scope
    '<superadmin-user-uuid>',
    'KEY_ROTATION',
    'encryption_key',
    NULL,
    jsonb_build_object('old_version', 1, 'new_version', 2),
    'compliance_maintenance',
    NOW()
);

When the audit-log read API ships (Layer 1.13), the same row is queryable through GET /v1/audit-logs?action=KEY_ROTATION.

Emergency rotation (suspected key compromise)

Same procedure, compressed:

  1. Generate V<N+1>.
  2. Add it and immediately promote it active (skip the soak between steps 2 and 3 — both deploys back-to-back).
  3. Run re-encryption (when the tool exists).
  4. Remove the old key without waiting 24 h.
  5. Audit-log with action_context = 'security_incident' and a free-text incident ID.
  6. Review the audit log for entity_type = 'patient_profiles' (etc.) covering the suspected exposure window — break-glass access pattern queries become available once the read API ships.

Rollback

If a rotation deploy regresses something, the old key is still loaded as long as you didn't run step 5 yet. Revert ACTIVE_ENCRYPTION_VERSION to the previous version and redeploy. New encryptions go back to V1; nothing needs re-encryption because no data was changed under V2 (or the tiny window of changed data still decrypts fine because V2 stays in the keyring during rollback).

If you already removed the old key (step 5), rolling back means re-introducing it and also re-running the re-encryption tool with the old version as ACTIVE_ENCRYPTION_VERSION. Avoid being in this position by not running step 5 until you're confident.

Compliance pointers

  • HIPAA §164.312(a)(2)(iv) — encryption at rest. AES-256-GCM with versioned keyring satisfies this; the periodic rotation cadence above satisfies §164.308(a)(8) evaluation.
  • GDPR Article 32(1)(a) — encryption as a technical measure. Same helper, same cadence.

The encryption design (cached DEKs vs per-record envelope, version byte in the blob, no AAD in v1) is documented in decisions.md — read those before changing the wire format.