Skip to content

Cross-Cutting Patterns Catalog

Every cross-cutting pattern that any feature in apps/docs/features/ depends on. This is the catalog: what each pattern is, where it applies, what schema/middleware/code it needs.

Why this exists. A phased plan that sequences features without first cataloging the patterns those features share leads to schema rewrites — every late-arriving feature surfaces a pattern that should have landed earlier. This doc fixes that by listing every pattern up front so the implementation order can be derived from dependencies, not from feature-name ordering.

How to read. Each pattern has: what it is, where it applies (which features need it), what infrastructure it requires, and what foundation work depends on it. Use this together with data-model.md (entities) and dependency-map.md (build order).

SQL is illustrative

SQL fragments in this document are examples meant to convey shape and intent — they're not authoritative reproductions of the production schema. The real migrations live in services/api/migrations/core/.


Pattern Index

Authorization & Isolation

Identity Model

Compliance & Audit

Data Modeling

Infrastructure

Operational

Commercial Model

Data Egress

Tenancy Sub-Structure

Frontend Performance

Capability & Integration Architecture

Media Pipelines


Authorization & Isolation

P1: Multi-Tenancy + RLS

What. Every tenant-scoped table carries organization_id UUID NOT NULL REFERENCES organizations(id) and enables RLS with policies that compare organization_id = current_app_org_id(). Data isolation is enforced at the database, not the application.

Where it applies. Every table containing tenant data — i.e., everything except platform-level tables (organizations, principals, humans, platform_memberships, permissions, roles system templates, patient_profiles, patient_caregivers).

Required infrastructure.

  • current_app_org_id() RETURNS UUID — reads app.current_org_id session var
  • current_app_principal_id() RETURNS UUID — reads app.current_principal_id
  • current_app_principal_type() RETURNS TEXT — reads app.current_actor_type
  • current_app_role() RETURNS TEXT — reads app.current_role
  • current_app_has_permission(resource TEXT, action TEXT) RETURNS BOOLEAN
  • Middleware that sets these session vars per request inside the connection's transaction
  • Index idx_{table}_org ON {table}(organization_id) on every tenant table

Foundation status. Implemented for the tables that exist (orgs, principals, humans, organization_memberships, roles, role_permissions, permissions, platform_memberships, audit_log, organization_domains).

Pre-Phase-3 requirement. RLS integration tests (already in 2.11) — every new tenant table must inherit this pattern verbatim.


P2: Two-Pool Database Architecture

What. Two pgxpool instances:

  • AdminPool — owns the schema, bypasses all RLS. Used only when superadmins act platform-wide, or when the system needs to run cross-tenant queries (org creation, audit archival, global content management, anonymised cross-tenant aggregate counters in Console).
  • AppPool — restricted role (restartix_app), RLS always enforced. Default for all tenant-scoped traffic.

Where it applies. Every request. Middleware decides which pool based on is_superadmin and the endpoint's nature.

Required infrastructure.

  • restartix_app PostgreSQL role with ALTER DEFAULT PRIVILEGES
  • OrganizationContext middleware that picks the pool, acquires the connection, and sets the three session vars in a transaction-scoped set_config(..., true) call

Foundation status. Implemented.

Implication for new code. Repos must always use TxFromContext(ctx) — never grab raw pool connections. Going around the middleware skips both RLS and the audit log hook (P10).


P3: Per-Org Permission RBAC

What. Authorization is per-org permission codes, not global role enums. Four authorization layers:

LayerStored inManaged by
1. Permission catalogpermissions (resource, action)Migrations only
2. System role templatesroles where is_system=TRUE AND organization_id IS NULLConsole
3. Per-org cloned system rolesroles where is_system=TRUE AND organization_id=<org>Clinic admin
4. Per-org custom rolesroles where is_system=FALSE AND organization_id=<org>Clinic admin

Where it applies. Every endpoint, every RLS policy on tenant tables (current_app_has_permission(resource, action) in WITH CHECK and USING).

Required infrastructure.

  • Tables: permissions, roles, role_permissions, organization_memberships.role_id, platform_memberships
  • Middleware: RequirePermission(code) (canonical), RequireSuperadmin() (cross-tenant)
  • DB invariants: clear_organization_memberships_on_superadmin_grant trigger (one-hat rule); principal_is_human(uuid) CHECK on platform_memberships (superadmin-is-human invariant)

Foundation status. Implemented; reserved permissions for Phase 2 only. Each new feature seeds its own permissions via migration.

Implication for every feature. Run /new-domain and /new-migration skills — they enforce the seeding + RLS gating pattern.


P4: Three RLS Scoping Variants

What. RLS isn't always organization_id = current_app_org_id(). Three variants exist:

  1. Org-scoped (default): WHERE organization_id = current_app_org_id(). Most tables.
  2. Human-scoped via patient_profile helper: WHERE patient_profile_id IN (SELECT current_human_patient_profile_ids()). Used for patient_profiles, patient_caregivers, and any table where a patient (a human) acts on their own data including dependents they manage.
  3. Dual-scope (global + org): WHERE organization_id IS NULL OR organization_id = current_app_org_id(). Used for platform-curated content visible to all orgs (exercises, exercise taxonomy, treatment_plans with global scope).

Required infrastructure.

  • current_human_patient_profile_ids() RETURNS SETOF UUID — returns the union of (patient_profiles rows where human_id = current_app_principal_id()) and (patient_profiles rows reachable via patient_caregivers). Patient access is human-only; agents and service accounts do not act as patients.

Where it applies.

  • Variant 2: patient_profiles, patient_caregivers, and any patient-facing read on appointments/forms/treatment_plans/patient_service_plans/etc. uses this helper as part of the SELECT policy.
  • Variant 3: exercises, exercise_categories, exercise_body_regions, exercise_equipment, treatment_plans (when organization_id IS NULL).

Foundation status. Variant 1 implemented. Variant 2 helper implemented at Layer 2.1 (current_human_patient_profile_ids() in services/api/migrations/core/000006_patient_identity.up.sql, with current_human_is_patient_at(org) as the org-membership equivalent). Variant 3 pattern not implemented (must land before exercise library / global treatment plans).


P5: Public-Resolve Endpoints

What. Endpoints that must work without authentication (organization_domains lookup by slug or domain, public booking availability and POST /v1/public/bookings). They run on AdminPool — owner role, RLS bypassed — with the handler doing a tightly-scoped query (single-row equality match on slug or verified domain, allow-listed columns only). AppPool with no session vars sees zero rows on every authenticated table; there are no current_app_role() IS NULL carve-out policies.

Where it applies.

  • organizations + organization_domainsGET /v1/public/organizations/resolve looks up an org by slug or verified domain. The handler runs the lookup directly via AdminPool with allow-listed columns (name, slug, branding); RLS is not the gate.
  • services and calendars — public-read for the booking flow (when it ships) follows the same pattern: AdminPool query, handler-side column allow-listing, no public RLS policy.
  • appointments — public INSERT path for guest bookings (when it ships) runs via AdminPool with handler-validated input.

Required infrastructure.

  • Public routes skip the organization-scope middleware entirely
  • Each feature with public surface owns the column-allow-list in the handler — egress filtering at the application layer, not via RLS

Why not RLS public policies? The previous design used current_app_role() IS NULL policies on AppPool with no session vars set. The implementation kept tripping on "what columns are safe to return on the public path?" — RLS gates row visibility, but every column on the row is then exposed. Moving public reads to AdminPool with handler-side column selection makes the egress shape explicit at code-review time. See data-classification.md and the comment block under organizations_select in 000002_tenancy_rbac.up.sql.

Foundation status. Implemented for organizations + organization_domains. The pattern is documented and reusable.


Identity Model

P6: Patient Profile Portability

What. A "patient profile" is a real-world human receiving care. Separated from humans (auth identity) so:

  • Demographics and universal health facts (DOB, blood_type, allergies, chronic_conditions, insurance_entries) are owned by the patient and travel with them across orgs.
  • Account-less patients are first-class — patient_profiles.human_id can be NULL.

Tables.

  • patient_profiles (no organization_id — RLS via P4 variant 2). human_id FK to humans(principal_id).
  • patients (per-org link: organization_id, patient_profile_id, consumer_id, profile_shared, last_used_at, deleted_at)

Where it applies. Anywhere a clinical record references "the patient" — appointments, forms, treatment_plans — the FK is to patient_profiles.id, not humans(principal_id) and not patients.id. (Some tables also keep patients.id for org-scoped business records like patient_service_plans.)

Required infrastructure.

  • current_human_patient_profile_ids() helper (P4)

Foundation status. Implemented at Layer 2.1/2.2 in services/api/migrations/core/000006_patient_identity.up.sql and internal/core/domain/patientprofiles/ + internal/core/domain/patients/.


P7: Caregiver / Account-less Patients

What. A human can act on behalf of multiple patient_profiles. Example: daughter (humans row, principal_id 42) manages bookings for her elderly father who has no email and no account (patient_profiles.human_id = NULL). When the father later creates an account, set patient_profiles.human_id = father_principal_id — all clinical history is preserved because everything references patient_profile_id, not human_id.

Tables. patient_caregivers (patient_profile_id, caregiver_human_id FK to humans(principal_id), relationship enum: self | parent | child | spouse | sibling | caregiver | other).

Where it applies. RLS on every patient-facing table uses current_human_patient_profile_ids() which UNIONs:

  1. patient_profiles rows where human_id = current_app_principal_id()
  2. patient_caregivers.patient_profile_id rows where caregiver_human_id = current_app_principal_id()

The function is human-only by design — agents and service accounts do not act as patients or caregivers.

Foundation status. Implemented at Layer 2.1 in services/api/migrations/core/000006_patient_identity.up.sql.


What. When patients.profile_shared = FALSE (default), org staff only see patient_profiles.name. Field-level filtering at the application layer hides DOB, blood type, allergies, insurance from staff. When the patient signs the profile-sharing consent form, the API flips profile_shared = TRUE and staff see the full portable profile. The patient themselves always sees their own full profile.

profile_sharing is an org-scope, legal_basis = 'consent' (withdrawable) Tier B purpose in the unified consent ledger (P17). It registers in F3.5 and writes ledger rows with source = 'form' whose source_form_id FKs the signed F3 form instance.

Where it applies. Patient profile API responses to staff. Forms with profile_field_key (which read/write patient_profiles columns) are still allowed for the patient and their managers but staff can only see masked values until consent is granted.

Required infrastructure.

  • The consent form itself is just a regular signed form (P14 immutability + P10 audit log) with a known template
  • Automation rule (P28) creates the form during patient onboarding
  • A handler that listens for form.signed on the right template, inserts the consents ledger row (purpose_code = 'profile_sharing', source = 'form'), and flips patients.profile_shared
  • Withdrawal flips patients.profile_shared back to FALSE at the same clinic

Foundation status. Schema present (patients.profile_shared exists in spec). Ledger plumbing arrives with 1B.9; the form-driven hook arrives with F3.5. Application enforcement and automation not implemented.


P9: Specialist ↔ Human Linkage

What. specialists is a per-org table for healthcare providers. specialists.human_id is UNIQUE and nullable (FK to humans(principal_id)): a specialist can have a system login (set), or be a "calendar-only" entity managed by admin (NULL). When set, the human has role = specialist in the org and can access their schedule + patient appointments. Specialists are humans by domain definition; the FK target enforces this.

Where it applies. RLS on tables filtered by "specialist sees their own appointments": WHERE specialist_id IN (SELECT id FROM specialists WHERE human_id = current_app_principal_id()).

Foundation status. Designed. Not implemented.


Compliance & Audit

P10: Audit Logging

What. Every mutation (CREATE/UPDATE/DELETE) writes a synchronous row to audit_log. Failed requests (401/403/500) also log. The write is append-only — RLS on audit_log has no UPDATE/DELETE policies. audit_log is the single source of truth for compliance audit; there is no telemetry forwarding (the earlier P32 design that forwarded audit_log rows to a separate Telemetry compliance store has been rejected — see decisions.md → Why telemetry is PG + S3, not ClickHouse).

Schema. audit_log(id, organization_id, actor_id, actor_type, action, entity_type, entity_id, changes JSONB, ip_address INET, user_agent, request_path, status_code, action_context, break_glass_id, impersonation_id, request_id, created_at). actor_id FKs to principals(id); actor_type is denormalized for at-a-glance reads. Indexes on org+entity+time, actor+time, context, break_glass_id, impersonation_id, request_id, status+time.

AI provenance is a sibling table. audit_ai_provenance(audit_log_id PK FK → audit_log, model_version, inputs_hash, confidence) — one row per audit event that involved an AI model; purely human actions have no row here. Split out (per Layer 1.26) so AI-feature schema churn — prompt versions, tool-call inventories, model-output rationales — doesn't pollute the core audit table's compliance contract.

Where it applies. Every authenticated handler. Every mutation. Failed auth attempts.

Required infrastructure.

  • Audit middleware that wraps the request, captures method/path/status, computes before/after diff for UPDATE/DELETE, masks sensitive fields (P11), inserts the row before the response is returned
  • Three-tier retention job: hot (0-12 mo Postgres), warm (12 mo-6 yr S3 JSONL.gz), purge (6 yr+). Special cases: break-glass never deleted; GDPR-operation entries 7 yr; key rotation permanent.

Foundation status. Implemented at Layer 1.1. Table in services/api/migrations/core/000001_init.up.sql; recorder in services/api/internal/core/audit/; HTTP middleware writes one row per mutation + every 5xx + every 403 + Bearer-tokened 401s.

What audit_log does NOT capture. Two carve-outs:

  • Operational-metadata bumps (per CLAUDE.md). humans.last_activity, organization_memberships.last_used_at, patients.last_used_at activity tracking, and the universal set_updated_at trigger that bumps updated_at on every UPDATE — these are presence signals, not state transitions, and would generate noise without forensic value.
  • Notification dispatcher sends. The notifications table is the canonical record of what was sent to whom (recipient, channel, template, status, sent_at, retry counts) at strictly higher fidelity than an audit row would carry. audit_log captures the user/system action that queues the send (org-create, specialist clicks "resend", scheduled reminder run) — the dispatcher transitioning a row from pending → sent is operational mechanics, not a security event. Carve-out applies to transactional and marketing dispatches alike; Article 30 records-of-processing is satisfied by the notifications table itself. The single exception: an admin "send arbitrary message" surface (operator types a custom email and hits send) audits that operator action at the queueing point.

P11: Sensitive Data Redaction in Logs

What. Before writing to audit_log.changes or any structured log, mask values for keys matching: password, secret, token, api_key, apikey, authorization, cookie, session. Replace with [REDACTED].

Where it applies. Audit middleware (P10), structured slog handler. (Earlier "telemetry forwarding" applicability removed — telemetry no longer ingests audit_log rows.)

Required infrastructure.

  • Slog handler with key-pattern redaction (redact.SlogReplaceAttr from internal/shared/redact).
  • Audit helper that walks the changes diff and redacts before insert (internal/core/audit/redact.go, sharing the same key list).
  • Test that injects all sensitive keys and asserts redaction.
  • Pseudonymization helper (internal/shared/pseudonym.UserID) that SHA-256-hashes user UUIDs for any cross-tenant aggregate egress that needs it. Note: the Layer 2 Telemetry pipeline (P32) does NOT pseudonymize — readers are clinic-scoped. The helper is retained for any future genuinely cross-tenant aggregate surface.

What NOT to log.

  • Raw PII values. Names, emails, phone numbers, dates of birth, addresses, IP addresses (beyond the dedicated audit_log.ip_address column), free-text patient input — even when their map key is innocuous. Log the ID instead and let auditors join.
  • Anything from a request body unrolled as structured fields. The audit pipeline owns the field-level diff and redacts it; ordinary log lines stick to IDs and error envelopes.
  • Encrypted ciphertext (BYTEA) or encryption keys, key versions, KMS responses.
  • Patient_person IDs in any log destined for a cross-tenant aggregate egress surface — pseudonymize via pseudonym.UserID first. (Telemetry's own ingest path is clinic-scoped and stores principal_id plain — see decisions.md → Why telemetry is PG + S3, not ClickHouse.)

If a value is not on the sensitive-key list but is still PII (e.g. a key called patient_email), the redactor will not catch it — it's the call site's responsibility not to log the value at all. This is enforced at PR review and via the audit-data-leaks skill sweep, not at runtime.

Foundation status. Implemented in services/api/internal/shared/redact/ (slog handler + tests covering every documented pattern), services/api/internal/core/audit/redact.go (audit JSONB walker), and services/api/internal/shared/pseudonym/ (cross-tenant aggregate-egress helper). Closed in Layer 1.5.


P12: Field-Level Encryption

What. Column-level AES-256-GCM encryption at the application layer for the narrow set of columns that earn it: credential material (auth_secret) and regulated identifiers (pii_regulated). Everything else is plaintext + layered defense (RLS + audit + at-rest disk encryption + encrypted backups + restricted DB access). Encryption key in AWS KMS in production, local key file in dev. Encrypt-on-write / decrypt-on-read at the repository layer (handlers see plaintext). Each ciphertext stamped with key version for rotation.

Rule (mechanically enforced). A column whose name ends in _encrypted MUST be BYTEA AND classified pii_regulated or auth_secret. A column classified pii_regulated MUST be BYTEA AND named *_encrypted. A column named *_hash whose class is auth_secret MUST be BYTEA. Enforced by services/api/cmd/check-classification — see decisions.md → Why most PII is plaintext.

Encrypted columns today.

  • organization_billing.tax_id_encrypted BYTEA (CUI for RO clinics — pii_regulated)
  • Future credentials and signing secrets (auth_secret) when integrations / outbound webhooks ship.

Plaintext columns (rely on the layered envelope). Patient phone, emergency-contact phone, names, emails, dates of birth, addresses, insurance entries, allergies, diagnoses, prescriptions, treatment notes. Industry-default in EHRs; protected by RLS, audit, at-rest disk encryption, encrypted backups, restricted DB access.

Where it applies. National identifiers and credential material across the whole platform. Other PII columns relying on column-level encryption is no longer the pattern — they earn protection from the layered envelope.

Required infrastructure.

  • internal/core/crypto/ package with Encrypt(plaintext) -> blob and Decrypt(blob) -> plaintext. The version travels inside the wire blob — [1-byte version][12-byte nonce][ciphertext+tag] — so the schema only needs one BYTEA column per encrypted field.
  • Multi-version Keyring interface so rotation is supported from day one (re-encrypt path + dual-decrypt window).
  • KMS-rooted protection on the keyring source. Phase 1 (current): the ENCRYPTION_KEYS value lives in the restartix/{env}/encryption Secrets Manager secret enveloped under a customer-managed KMS CMK, so the SM fetch at boot transparently goes through KMS. Phase 2 (deferred): direct kmsKeyring with per-data-key KMS calls + per-tenant key custody (BYOK).

Foundation status. Implemented in services/api/internal/core/crypto/. Helper, in-memory keyring, multi-version support, tests, and config wiring landed in 1A.3. The customer-managed CMK that envelopes the SM-stored keyring is provisioned at 1E.3. The direct kmsKeyring is stubbed; real per-data-key KMS resolution is Phase 2. See reference/encryption.md for the full operator and developer guide.


P13: Soft Delete

What. Medical records are never hard-deleted. Add deleted_at TIMESTAMPTZ NULL to clinical tables. Repos default-filter WHERE deleted_at IS NULL. RLS policies optionally exclude soft-deleted rows unless caller has data.view_deleted permission.

Tables that need it (per specs).

  • appointments.deleted_at
  • patients.deleted_at
  • specialists.deleted_at
  • exercises.deleted_at
  • treatment_plans.deleted_at
  • (likely others as features ship)

Tables that do NOT need it (configuration data).

  • services, products, calendars, custom_fields, form_templates, pdf_templates, automation_rules, webhook_subscriptions, segments — these are admin-editable config; hard delete is fine.

Where it applies. Any table holding clinical PHI or that shows up in audit/legal review.

Required infrastructure.

  • Repository convention: every Get/List query filters by deleted_at IS NULL via softdelete.ActiveFilter. "Include deleted" lives in a separate repo method, never an optional bool argument.
  • softdelete.SoftDelete(ctx, table, id) performs the UPDATE. Hard-DELETE is forbidden by the absence of a DELETE policy on soft-deletable tables.
  • /new-migration skill template covers the column, partial index, and the soft-delete RLS variant.
  • Partial index pattern: CREATE INDEX idx_{table}_active ON {table}(organization_id) WHERE deleted_at IS NULL.
  • data.view_deleted permission gates the RLS SELECT carve-out for admin recovery.
  • GDPR anonymization sketched at internal/core/gdpr/Anonymize (column-level overwrite). Full orchestrator (cross-table erasure, audit logging, retention exemptions) lands in Layer 12.

Foundation status. Implemented in services/api/internal/shared/softdelete/ and services/api/internal/core/gdpr/. Helper, RLS variant, data.view_deleted permission seed, and /new-migration + /new-domain template updates landed in Layer 1.4. First column lands when Layer 2 ships patients.deleted_at.


P14: Append-Only / Immutability

What. Two distinct sub-patterns:

14a. Append-only tables. No UPDATE or DELETE policies. Once inserted, the row is permanent.

  • audit_log (RLS enforces)
  • break_glass_sessions (UPDATE only on the close path: closed_at IS NULL → NOT NULL; the row itself is otherwise immutable. DELETE policy absent)
  • custom_field_versions, form_template_versions, pdf_template_versions, treatment_plan_versions, segment_versions — version history is immutable
  • automation_executions — execution audit trail
  • webhook_events — delivery log

14b. Immutable-after-state-transition. A row becomes immutable once a status flag is set.

  • forms — once status = 'signed', all UPDATE/DELETE return 409 Conflict (enforced at handler layer; RLS still allows the row to be selected)
  • Signed PDFs — embedded signature is part of the file; the row stays mutable for metadata, but the file URL doesn't change

Where it applies. Compliance / legal-evidence rows (audit, signatures, version history).

Required infrastructure.

  • 14a: RLS policies that omit UPDATE/DELETE
  • 14b: Handler-level guard returning 409 Conflict; service-level guard preventing mutation in repos

P15: Break-Glass Platform Elevation

What. Time-bound, audited, narrow-scope grant for platform staff (superadmin / support_engineer) to access identifiable cross-tenant patient data. The processor boundary is the default; this pattern is the documented exception path. Every audit_log row written inside an elevated session carries break_glass_id = <session.id> + action_context = 'break_glass'. Reads are also logged (unlike normal requests) on every protected route — the gate's whole point.

The earlier draft of this pattern called it "Break-Glass Emergency Access" with a clinical-staff dual-approval shape. That conflated two different concerns: clinical "I need to see Patient X right now even though I'm not their assigned specialist" (which is a clinic-internal access pattern, see P16 impersonation) vs. platform "I need to see this clinic's data to investigate a support ticket" (which is a controller/processor boundary concern). Foundation 1B.11 lands the latter; the former is not a foundation primitive — clinics manage staff access to their own patients via per-org RBAC, not break-glass.

Schema. break_glass_sessions(id, principal_id, organization_id, scope, reason_category, reason_text, reason_ref NULL, opened_at, expires_at, closed_at, closed_by_principal_id NULL). scope IN ('patient_list', 'patient_detail', 'audit_full', 'cross_org_lookup', 'org_management') — granular by design; patient_list does NOT cover patient_detail, the principal must re-elevate. org_management is added in 1B.11.x for Console writes against the clinic's HR surface — see P55. reason_category IN ('support_ticket', 'security_incident', 'dsar_routing', 'fraud_investigation', 'platform_engineering'). CHECK length(btrim(reason_text)) >= 10 AND expires_at <= opened_at + INTERVAL '4 hours'. Default expiry 1h, max 4h.

Active-session uniqueness. Partial unique (principal_id, organization_id, scope) WHERE closed_at IS NULL — same-(principal, org, scope) re-elevation returns the existing row instead of inserting a duplicate. Catches double-clicked elevation modals without spamming audit / notifications.

Lazy expiry finalize. A row with closed_at IS NULL AND expires_at < NOW() is closed on the admin pool the first time Service.ActiveFor reads it — closed_at = expires_at, closed_by_principal_id = NULL (system-finalized at the natural-end moment). Keeps the unique index honest without requiring a sweeper cron.

Authorization. Pure-Go platform permissions (principal/platform_permissions.go) — PlatformPermBreakGlass{PatientList,PatientDetail,AuditFull,CrossOrgLookup,Manage}. Superadmin holds everything via IsSuperadmin; support_engineer holds patient list + detail + audit full (cross-org lookup + manage stay superadmin-only). DB-side: REVOKE INSERT/UPDATE/DELETE on break_glass_sessions from restartix_app (mirrors audit_log + notifications) — service-layer permission check is the load-bearing gate, REVOKE is the floor.

Audit attribution. Middleware RequireBreakGlass(scope) calls set_app_break_glass_session_id(<session.id>) after match; audit_log_insert reads the GUC into audit_log.break_glass_id. Every audit row written downstream — handler-emitted AND trigger-cascaded — picks up the linkage automatically with no per-call-site plumbing.

Notification. Always-on email + (future) in-app banner to every clinic admin at the target org when a session opens. Sent via 1A.18 notify.Send with idempotency key <session_id>:<admin_principal_id> so retries dedup at the notification layer. The session row is the artifact; if email transport hiccups, the dispatcher retries and the Clinic admin banner reads break_glass_sessions directly.

Foundation status. Shipped in Foundation 1B.11. The RequireBreakGlass(scope) middleware, elevation/close/list endpoints, audit attribution, and notification fan-out all closed. Console route gating + Clinic admin banner UI light up alongside their respective surfaces in 1C.1 / 1C.2.


P16: Patient Impersonation (Clinic-Internal)

What. Time-bound, audited, clinic-internal grant for clinic staff to act on a patient's behalf — assisted form fill, accessibility help, language barriers, troubleshooting. Per-clinic counterpart to break-glass (P15). Lives entirely within one clinic's controllership scope; this is not a controller/processor concern — clinics trust staff they've hired, the audit + transparency mechanism is what makes staff actions on patient data reviewable. Every audit_log row written inside an active session carries impersonation_id = <session.id> + action_context = 'impersonation'.

Authorship semantics — simple, locked. Every mutation inside an active session attributes to the staff principal at the data layer AND the audit layer. The audit row carries impersonation_id; consumers that want "who really did this" follow the link. No data-layer acting_as_patient_id rebind — the alternative split-author model (forms appearing patient-authored at data layer + staff in audit log) was considered and dropped at design time. Reasoning: every Layer 2+ feature with an "author" column would have had to remember coalesce(acting_as_patient_id, current_principal_id), and one missed call site would leak staff names into patient-facing records. Foundation discipline argues against the cross-cutting invariant. See Foundation 1B.13 decisions.

Schema. patient_impersonation_sessions(id, staff_principal_id, target_patient_id, organization_id, reason, opened_at, expires_at, closed_at, closed_by_principal_id NULL). target_patient_id FKs patients(id) (per-org row); organization_id denormalized for RLS efficiency, mirroring patient_subscriptions. CHECK length(btrim(reason)) >= 10 AND expires_at <= opened_at + INTERVAL '4 hours'. Default expiry 1h, max 4h. The session row IS the artifact — no parallel "IMPERSONATE event in audit_log" representation; the audit chain is impersonation_id → patient_impersonation_sessions(id), not a self-FK.

Active-session uniqueness. Partial unique (staff_principal_id, organization_id) WHERE closed_at IS NULLone impersonation at a time per staff member per clinic. Locked design ("one thing at a time"): the staff member must close the current session before opening another for a different patient. Distinct from break-glass's (principal, org, scope) shape because impersonation has no scope vocabulary.

Lazy expiry finalize. Mirrors break-glass. A row with closed_at IS NULL AND expires_at < NOW() is closed on the admin pool the first time Service.ActiveFor reads it — closed_at = expires_at, closed_by_principal_id = NULL (system-finalized at the natural-end moment).

Authorization. Per-org RBAC permission patients.impersonate, seeded by 000013 with admin + customer_support grants by default (specialist deliberately excluded — clinical role, not service role). DB-side: AppPool + RLS WITH CHECK (staff_principal_id = current_app_principal_id() AND organization_id = current_app_org_id() AND current_app_has_permission('patients', 'impersonate')). Same write-side pattern as consents / organization_invites — the opening principal IS an authenticated org member with the right RLS context, so the standard request-tx pattern works. Distinct from break-glass's AdminPool-with-REVOKE because platform staff (break-glass openers) lack tenant membership; impersonation openers don't.

Audit attribution. Middleware RequireImpersonation(svc, paramName) calls set_app_impersonation_session_id(<session.id>) after match; audit_log_insert reads the GUC into audit_log.impersonation_id. The function reads BOTH current_app_break_glass_id() AND current_app_impersonation_id() unconditionally so a future legitimate compounding case writes both columns correctly without another schema change — runtime cross-context exclusion (below) prevents the case from arising today.

Cross-context exclusion. Bidirectional: impersonation.Service.Open rejects when the principal already has an active break-glass session for the same (principal × org), and breakglass.Service.Open rejects symmetrically. One elevated session at a time per principal × org. Compounding contexts are forbidden at the service layer (cheap pre-INSERT check); the mutual-exclusion invariant means only one of audit_log.{break_glass_id, impersonation_id} is non-NULL on any given row in production.

Patient transparency. Patient sees their own session history at every clinic via RLS self-read on patient_impersonation_sessions (target_patient_id IN (current_human_patient_profile_ids() → patients)), surfaced through GET /v1/me/patient-impersonation-sessions. Foundation-tier scope is session metadata only (when, who, why, how long). Per-action drill-down ("what entities they touched") is a deferred concern; if/when it ships, it lands as the patient_account_activity projection table populated by triggers — not by giving patients SELECT on audit_log. Audit_log stays staff/forensic-only.

Foundation status. Shipped in Foundation 1B.13. Schema, RLS, GUC, redefined audit_log_insert, RequireImpersonation middleware, open/close/list/get endpoints, patient /v1/me/... endpoint, cross-context guard (both directions), per-principal rate limit, 14 RLS integration tests all closed. Production consumer routes (F3 forms, F5 appointments) wire RequireImpersonation when those features ship; foundation-tier acceptance is the schema + middleware + audit-attribution end-to-end. Clinic admin oversight UI (1C.2) + patient access-history UI (1C.3) consume the shipped backend when those slices ship.


What. A single append-on-grant ledger records every consent event across two scopes:

  • Platform scope (organization_id IS NULL): platform_terms (contract basis), platform_privacy_notice (legitimate-interest, informational acceptance). Accepted once per principal at sign-up; applies across all orgs the patient ever joins. Non-withdrawable except by account deletion.
  • Org scope (organization_id set): org_terms, org_privacy_notice, marketing_email, marketing_sms, analytics, ai_processing, plus form-driven (Tier B) medical purposes (telemedicine, video_recording, biometric_capture, treatment_specific_*) registered at F3.5. Per-clinic, per-purpose — consent at Clinic A does not extend to Clinic B.

Each purpose carries a legal_basis (contract | legitimate_interest | consent | legal_obligation | vital_interest). The withdrawable flag is derived: only consent-basis purposes accept patient-initiated withdrawal. Withdrawing org_terms at a clinic cascades withdrawal of every org-scope consent at that org and triggers patients.deleted_at at that clinic. Withdrawing platform_terms requires account deletion (triggers GDPR erasure in F11.1).

This codifies the controller/processor split: the clinic is data controller for org-scope consents and clinical records; the platform is processor. See decisions.md → Why clinic is controller, platform is processor.

Sources. Each ledger row carries a source discriminator: signup_checkbox | self_toggle | form | staff_action | api. Tier B medical consents are source = 'form' rows whose source_form_id FKs to the signed F3 form (provenance inherited through the immutable form row, not duplicated in the ledger).

Schema. Three tables — see data-model.md Area 15:

  • consent_purposes(code, scope, name, description, legal_basis, withdrawable, ...) — catalog
  • consent_purpose_versions(id, purpose_code, organization_id NULL, version, body_translations, ...) — versioned text; organization_id set = clinic override (only valid for org-scope purposes; assembled by 1B.10's privacy notice template flow for org_privacy_notice)
  • consents(id, organization_id NULL, patient_profile_id, purpose_code, purpose_version, source, source_form_id NULL, granted_at, granted_by_principal_id, withdrawn_at, withdrawn_by_principal_id, ...) — the ledger; append on grant, UPDATE-only-on-withdraw

Where it applies. Sign-up onboarding (platform consents accepted atomically with patients row creation), patient settings UI (self-toggle for marketing_* / analytics / ai_processing), every form-gated workflow (Tier B), staff-on-behalf-of-patient flows (CS rep flips marketing when patient calls in — gated by consents.manage), every appointment / video / telerehab feature that calls RequireConsent(purpose) middleware.

Re-consent middleware. Before serving any non-public route, current_required_consent_versions(principal_id, organization_id) returns the (purpose, version) pairs the user hasn't accepted yet (always includes platform purposes; includes the org's purposes when org-scoped). Non-empty result returns 412 consent_required; the client renders a blocking modal that posts back to the consent endpoint.

Foundation status. Ships in Foundation 1B.9. Privacy notice templates (1B.10) feed org_privacy_notice versions. Tier B medical purposes register in F3.5.


Data Modeling

P18: Versioning + Snapshots

What. Mutable templates have version history. When the template changes, the previous version is snapshotted to a _versions table. Instances reference the version they were created against, so editing a template never retroactively changes existing instances.

Tables that follow this pattern.

  • custom_fieldscustom_field_versions ↔ snapshotted into forms.fields JSONB
  • form_templatesform_template_versions ↔ snapshotted into forms.fields JSONB at form creation
  • pdf_templatespdf_template_versions ↔ referenced by appointment_documents.pdf_template_version
  • treatment_planstreatment_plan_versions (full sessions+exercises JSONB) ↔ referenced by patient_treatment_plans.treatment_plan_version
  • segmentssegment_versions (rule history)

Versioning workflow.

  1. Create template — version = 1, published = TRUE. Archive to _versions table.
  2. Edit — published = FALSE (draft state). Forms still reference the last published version.
  3. Publish — increment version, archive new state to _versions, published = TRUE. New instances use the new version.
  4. Rollback — copy a _versions row back to the main table as a draft.

Required infrastructure.

  • Convention: _versions table is append-only, has (parent_id, version) UNIQUE
  • Repository helper: "get latest published version of X"
  • Snapshot logic at instance creation

P19: Custom Fields + Profile Fields

What. Two parallel pools of patient/specialist/appointment/org attributes:

Custom fields (custom_fields + custom_field_values).

  • Org-scoped (org defines which fields exist for each entity_type).
  • entity_type: patient | specialist | appointment | organization.
  • field_type: text | textarea | select | date | checkbox | radio | number | email | phone.
  • system_key (nullable, immutable) — stable identifier for fields the system depends on (PDF templates, exports, integrations reference system_key not key/label).
  • Versioned (P18).
  • Values stored in custom_field_values keyed by (custom_field_id, entity_type, entity_id).

Profile fields (patient_profiles columns).

  • Cross-org portable, lives on the patient_profile.
  • Hardcoded canonical columns: date_of_birth, sex, phone, occupation, residence, blood_type, allergies, chronic_conditions, emergency_contact_name, emergency_contact_phone, insurance_entries (JSONB array).

Form fields can reference either.

  • custom_field_id → reads/writes custom_field_values
  • profile_field_key (one of: date_of_birth | sex | occupation | residence | blood_type | allergies | chronic_conditions | emergency_contact_name | insurance_entries) → reads/writes patient_profiles columns directly

A form field has at most one. A field with neither is a form-only field (value stored in forms.values only).

Required infrastructure.

  • Form auto-fill resolver: at form-instance creation, pre-populate values from custom_field_values + patient_profiles
  • Form save handler: write back to both stores when relevant fields change

P20: Dual-Scope (Global + Org) Tables

What. Some tables hold both platform-curated content and org-specific content in the same physical table, distinguished by organization_id IS NULL (global) vs organization_id = X (org-scoped).

Tables.

  • exercises, exercise_categories, exercise_body_regions, exercise_equipment
  • treatment_plans (with three scopes: global, org, custom-per-patient via created_for_patient_id)

RLS pattern.

sql
CREATE POLICY x_select ON x FOR SELECT USING (
    organization_id IS NULL                                     -- global visible to all
    OR organization_id = current_app_org_id()                   -- own org's
);

CREATE POLICY x_modify ON x FOR ALL USING (
    organization_id IS NOT NULL                                 -- can't modify globals via RLS
    AND organization_id = current_app_org_id()
    AND current_app_has_permission('x', 'manage')
);

Globals are managed via AdminPool (bypasses RLS).

Cloning. When a clinic clones a global into their org, translations (P21) are baked into the canonical columns at clone time so the clone is fully independent.


P21: Translation (UI + Content)

What. Two distinct translation layers, both driven by organizations.language_code (ISO 639-1).

21a. UI translationnext-intl (or equivalent) on the three frontends. Static UI strings, status enum display names, date/time/number formatting. Locale derived from org.

21b. Content translation — JSONB column on platform-curated tables only.

Tables that get translations JSONB NOT NULL DEFAULT '{}'.

  • exercises (name, description, video_url, video_thumbnail_url)
  • exercise_instructions (title, content, image_url)
  • exercise_contraindications (condition_name, description)
  • exercise_categories, exercise_body_regions, exercise_equipment (name)
  • treatment_plans where organization_id IS NULL (name, description)
  • treatment_plan_sessions for global plans (name, description)

Read pattern.

sql
SELECT
  COALESCE(translations->$lang->>'name', name) AS name,
  COALESCE(translations->$lang->>'description', description) AS description
FROM exercises
WHERE organization_id IS NULL;
-- $lang from organizations.language_code

What does NOT need translations.

  • Org-scoped content (services, forms, custom fields, automations, PDF templates) — created in org's language directly
  • Status enums — handled by frontend i18n
  • Patient-entered data — language-neutral

21c. Locale-aware sorting. User-input text columns (patient names, specialist names, clinic names, free-text fields) MUST sort with locale-aware collation, not codepoint order. Romanian diacritics (ă, â, î, ș, ț) all live above U+0080 — ORDER BY name and JS .sort() place "Ștefan" after "Z" instead of after "S".

  • Server (canonical, per CLAUDE.md → Production Scale "sorts always server-side"): ORDER BY name COLLATE "ro-x-icu" for Romanian-resolved requests, parameterized by the caller's locale. Bind collation per query rather than relying on column-level defaults — the same column gets sorted from multiple locales and a static default would be wrong for half of them.
  • Client (only when sorting an already-loaded slice): Intl.Collator(locale).compare. Plain .sort() and < on user-input text are forbidden.

No such sorts ship in Layer 1. The first one lands with the patient list (Layer 2.4); wire the collation in then — there's nothing to gate against today.


P22: Money / Currency

What. Money stored as DECIMAL(10,2) with a separate currency TEXT DEFAULT 'RON' (ISO 4217). No money PostgreSQL type. No floating-point.

Tables. services.base_price + currency, service_specialists.custom_price, service_plans.total_price + currency, products.price + currency, calendars.override_price.

Required infrastructure.

  • Helper for arithmetic in the API layer (Go's decimal package)
  • No multi-currency transactions in v1 — each org operates in a single currency, no auto-conversion
  • Display formatting handled by frontend i18n (P21a)

P23: Timezone-Aware Scheduling

What. All timestamps stored in UTC (TIMESTAMPTZ). Bookable slots are computed in a canonical scheduling timezone resolved per slot from a fixed chain of fallbacks. Weekly hours and date overrides are interpreted in that resolved timezone, not in the patient's browser timezone.

Resolution chain (per slot / per appointment).

  1. locations.timezone — for an in-person slot at a physical location, the location's IANA timezone wins. A London branch is London time regardless of which specialist is on duty.
  2. specialists.scheduling_timezone — for remote / telerehab slots (no location_id), or when the specialist explicitly works in a timezone different from the location's (e.g., a remote-employed specialist).
  3. organization_settings.default_timezone — org-wide fallback. The clinic's "house" timezone.
  4. Platform defaultEurope/Bucharest for the RO launch. Hardcoded fallback if all of the above are NULL.

The first non-NULL value in the chain wins. Each layer is optional except the platform default.

Where it applies. Specialist availability computation, calendar bookable slots, appointment display, override windows, recurring slot generation.

Required infrastructure.

  • Postgres TIMESTAMPTZ everywhere (already standard).
  • Per-layer timezone columns: locations.timezone, specialists.scheduling_timezone, organization_settings.default_timezone — all IANA strings (VARCHAR(64)), all nullable.
  • A single ResolveSchedulingTimezone(locationID, specialistID, orgID) helper in the scheduling package — every call site goes through it; no ad-hoc fallback chains.
  • Patient sees slots in their browser's local timezone (display-only conversion); the canonical computation timezone is the resolved one above.

Cross-references. P40 defines where locations.timezone lives and why locations don't fragment RLS. The "single true availability per specialist across locations" invariant lives there as well — a specialist can't be in two places at once, even across locations.


P24: Polymorphic Entity References

What. A few tables reference different parent tables based on a discriminator column.

Examples in spec.

  • custom_field_values(entity_type TEXT, entity_id UUID)entity_type is one of patient | specialist | appointment | organization, and entity_id references the appropriate table's id.
  • exercise_tags(tag_type TEXT, tag_id UUID)tag_type is one of category | body_region | equipment.

Tradeoff. No DB-level FK enforcement; integrity at application layer. Avoids three separate junction tables.

Convention.

  • Always pair (entity_type, entity_id) or (tag_type, tag_id) together
  • Validate entity_type matches the entity's declared type at write time
  • Index on (entity_type, entity_id) for fast lookup

P25: JSONB Conventions

What. JSONB used for: snapshots (immutable), flexible config, query-able semi-structured data.

Tables and JSONB columns (from specs).

TableColumnPurposeIndexed
formsfieldssnapshot of template fields
formsvaluessubmission dataGIN
formsfilesfile references
form_template_versionsfieldssnapshot at publish
treatment_plan_versionssessions_snapshotfull plan structure
treatment_planscondition_tagsTEXT[] for library filteringGIN
segmentsrulesrule definitionsGIN
segment_versionsrulesrule history
custom_field_valuesvalueTEXT actually, not JSONB — flat string
custom_fieldsoptionsselect/radio/checkbox options
appointment_documentsmetadataPDF generation metadata
patient_profilesinsurance_entriesarray of policies
pdf_templateseditor_stateblock editor structure
pdf_templateslayout_configpage settings
pdf_templatescomponents_usedlist of component names
automation_rulestrigger_configevent-specific config
automation_rulesconditionsrule conditions
automation_rulesactionsordered action list
automation_executionsactions_executedper-action results
webhook_eventspayloaddelivered JSON body
audit_logchangesbefore/after diff
calendar_specialistsoverride_weekly_hoursper-calendar hours
exercisestranslationslanguage-keyed (P21)

Convention.

  • Use jsonb_path_ops GIN index for tables that need fast JSONB containment queries (segment evaluation, form filtering)
  • Schema-validate JSONB at the application layer, not via Postgres CHECK constraints (too rigid)
  • Snapshot JSONB is immutable — never UPDATE in place

P26: UUIDv7 Primary Keys

What. Every table's primary key is UUID NOT NULL PRIMARY KEY DEFAULT gen_random_uuid(). Go side generates UUIDv7 via uuid.NewV7() for time-ordered inserts (better b-tree locality than v4). The DB default is gen_random_uuid() (v4) as a safety net for direct inserts.

Where it applies. Every table.

Foundation status. Implemented for tables that exist. CLAUDE.md mandates it. The feature spec schemas in apps/docs/features/ use BIGSERIAL/BIGINT PKs throughout — these are out of date and need to be updated to UUID before implementation.

Implication. All FK references in the data model are UUID. Sequence ID-based logic in spec (e.g., idx_audit_org_entity_time patterns) translates directly.


Infrastructure

P27: File Storage

What. All user-uploaded files and generated documents go to S3. Files are accessed via short-lived signed URLs (15-minute read, 5-minute write expiry). Bucket has Block Public Access enabled at the bucket level. Direct URLs are never public.

S3 key structure.

s3://{bucket}/
├── {org_id}/
│   ├── uploads/forms/{form_id}/field_{custom_field_id}/{uuid}.{ext}
│   ├── signatures/specialists/{specialist_id}/signature.png
│   ├── documents/{document_id}/{uuid}.pdf
│   └── templates/logos/logo.png
└── audit-archive/{org_id}/{year}/{month}.jsonl.gz

Cross-tenant isolation.

  • Every operation prepends {org_id}/ to the key
  • Path traversal blocked: .. in paths rejected
  • orgScopedKey() helper validates the prefix matches the authenticated org

Where it applies.

  • Form file uploads (P19 file fields)
  • Specialist signature images
  • Generated PDFs (reports, prescriptions, certificates, invoices)
  • Org branding assets (logos)
  • Audit archive (P10 warm tier)
  • Exercise videos (alternative to Bunny Stream — exercises.video_provider enum)

Required infrastructure.

  • internal/integration/s3/ package with Upload, GetSignedURL, Delete — all org-scoped
  • Bucket policy denying public access + denying unencrypted uploads
  • MIME type validation (Content-Type header + magic bytes)
  • File size limits per field (10 MB default; 250 MB hard max)
  • Cleanup on form/org deletion

Foundation status. Implemented at 1A.8 (internal/integration/s3/). Required before forms with file uploads (Layer 4) and before document generation (Layer 7). Real bucket provisioning + bucket-policy application live in 1E.3. See reference/file-storage.md for the full convention.


P28: Internal Event Bus

What. Internal pub/sub for lifecycle events. When something interesting happens (patient.onboarded, appointment.first_booked, form.signed, treatment_plan.completed, etc.), code publishes an event to an in-process bus. Two consumer types listen:

  • Automation engine — matches events to enabled automation_rules and executes actions
  • Webhook dispatcher — matches events to webhook_subscriptions and delivers signed payloads

Event catalog (from automations spec).

  • Patient: patient.onboarded, patient.first_login, patient.profile_completed
  • Appointment: appointment.first_booked, appointment.booked, appointment.before_start, appointment.started, appointment.completed, appointment.canceled
  • Service plan: service_plan.enrolled, service_plan.session_started, service_plan.session_completed, service_plan.completed
  • Treatment plan: treatment_plan.created, treatment_plan.assigned, treatment_plan.activated, treatment_plan.session_completed, treatment_plan.completed, treatment_plan.expired
  • Form: form.created, form.completed, form.signed
  • Time: schedule.daily, schedule.weekly, schedule.date_reached

Required infrastructure.

  • In-process channel-based event bus (Go)
  • Event payload standardization (event_type, entity_type, entity_id, organization_id, occurred_at, data JSONB)
  • Both automations and webhooks consume the same catalog independently
  • Time-based events (schedule.*) need a cron / scheduler component

Foundation status. Implemented at Layer 1.9 (internal/core/events/). Required before automations (Layer 8) and webhooks (Layer 8). Every feature publishes its own events from day one — see architecture/events.md for the catalog and the publishing convention (publish AFTER the DB transaction commits, never inside one).


P29: Webhook Delivery

What. Org-scoped subscriptions to events. Delivery via HTTPS POST with HMAC-SHA256 signature. Retry with exponential backoff. Each delivery attempt logged in webhook_events. Idempotency keys prevent duplicate processing on retries.

Schema.

  • webhook_subscriptions(id, uid UUID, organization_id, url, description, events TEXT[], signing_secret, is_active, created_by, created_at, updated_at)
  • webhook_events(id, organization_id, subscription_id, event_type, payload JSONB, status, attempts, last_attempt_at, last_status_code, last_error, next_retry_at, created_at)

Signing. signing_secret is generated server-side at creation (32 random bytes hex-encoded, prefixed whsec_). Each delivery includes X-Webhook-Signature: t=<timestamp>,v1=<HMAC_SHA256(timestamp + body, secret)>.

Payload contract.

  • Metadata + IDs only, no PHI in webhook bodies (clinic decides what data to fetch via API after receiving the event)
  • Standardized envelope: {event_type, event_id, organization_id, occurred_at, resource_type, resource_id}

Required infrastructure.

  • Worker process or job queue for async delivery
  • Retry policy: e.g., 1m, 5m, 30m, 1h, 6h, 24h then exhausted
  • Dead-letter queue (status = exhausted) for inspection

Foundation status. Not implemented.


P30: Slot Hold System

What. When a patient selects a time slot during booking, the slot is held in Redis for ~10 minutes while they complete the booking form. Prevents double-booking without a database lock. SSE streams real-time slot availability to other connected patients viewing the same calendar.

Where it applies. Booking flow on every calendar.

Required infrastructure.

  • Redis client (already present)
  • Hold key pattern: hold:{calendar_id}:{specialist_id}:{slot_start_iso}{client_id, expires_at}
  • TTL configurable per-calendar (calendars.cooldown_minutes-related but separate)
  • Hold release on timeout, on cancellation, or on successful booking
  • SSE endpoint for live availability

Foundation status. Not implemented. Lands at Layer 5 (Scheduling) when the booking flow ships — see dependency-map.md § 5.4. The Redis client itself is already in Foundation 1A.


P31: Connected Account Credentials per Org (Cat B)

What. A clinic can connect its own external accounts (Google Calendar, Slack, HubSpot, Salesforce, ...) and the platform stores the resulting OAuth tokens / API keys per-org, encrypted (P12), in organization_integrations(id, organization_id, title, integration_service_id, credentials_encrypted, ...). The platform calls these services on behalf of the clinic, with credentials the clinic authorized — see glossary.md → Connected Account (Cat B). Available services come from a platform-defined catalog (integration_services).

Where it applies. Cat B integrations only — services where the clinic owns the account and the credentials. Examples: pushing appointments into the clinic's Google Calendar, posting alerts into the clinic's Slack workspace, syncing leads to the clinic's HubSpot.

Where it does NOT apply. Platform-curated providers (Cat A) — SES, Daily.co, Twilio, Anthropic, Stripe, etc. — where the platform owns the credentials. Those live in env / Secrets Manager + the foundation platform_service_providers resolution table for per-tenant brand-isolation overrides; that's a separate concern handled by Foundation 1C.2 (Curated Providers), not this pattern. See glossary.md → Integration categories for the canonical Cat A vs Cat B distinction.

Required infrastructure.

  • integration_services catalog table — platform-defined list of available Cat B services (name, OAuth/API-key shape, scopes, etc.)
  • organization_integrations rows — per-org connection, FK to integration_services, encrypted credentials
  • Per-org repo that decrypts on demand and refreshes OAuth tokens
  • No platform-default fallback — Cat B is opt-in per clinic; if no connection exists, the feature is unavailable for that org

Foundation status. Foundation 1C.5 (Connected Accounts — Cat B) — conceptual framing captured; schema design pending in a dedicated discussion chat. The schema sketch above is illustrative — the canonical design (catalog shape, per-service config shape, OAuth refresh semantics, multi-account-per-service support) lands when 1C.5 implements. The first OAuth-based clinic integration is the first CONSUMER. (For Cat A, the parallel pattern is Foundation 1C.2; Layer 6.5 Daily.co is a consumer of 1C.2, not 1C.5 — see dependency-map.md § 6.5.)


P32: Telemetry Ingest Pipeline

What. A separate Go service (services/telemetry/) that ingests two streams from the Patient Portal — pose-frame batches (binary float32 + gzip) and video-engagement events — computes aggregates server-side at session_end, persists them via events.Bus into Core API's existing Postgres, and stores per-session replay landmarks as gzipped binary blobs in S3. There is no separate ClickHouse, no separate compliance Postgres, no audit_log forwarding — the earlier P32 design has been rejected. See decisions.md → Why telemetry is PG + S3, not ClickHouse for the rationale.

Where it applies. The Patient Portal's exercise-session flow (video playback + optional MediaPipe pose tracking). No other surface ingests to telemetry.

Hot-path auth. Short-lived signed session token issued by Core API at exercise-session start (claims: principal_id, org_id, exercise_session_id, exp). Telemetry API verifies signature only — no Clerk JWT verification per pose-frame batch. HS256 today; Ed25519 reachable via the swap-point interface.

Aggregator. Telemetry API computes form_score / ROM / rep_count from accumulated landmarks at session_end and publishes via events.Bus. A Core API subscriber writes the aggregates into Postgres (pose_session_metrics, pose_rep_metrics monthly partitioned per P41, media_session_metrics, media_buffering_events monthly partitioned) and updates patient_exercise_logs.video_watch_percentage / pose_accuracy_score.

Replay storage. S3 blob at s3://restartix-telemetry/{org_id}/{session_id}.bin.gz containing the full landmark stream. Lifecycled standard → IA → Glacier → expire. Replay is fetch-by-session-id via Core API (signed URL), not a queryable store.

Reads. None on Telemetry API. All reads (specialist dashboards in Clinic app, patient progress in Portal, anonymised cross-tenant counters in Console) flow through Core API → Postgres / S3. Single source of authentication, RLS, audit, classification, and per-org permission.

Consent. Two named per-purpose flags using the foundation per-purpose consent ledger (1B.9): analytics (gates media events) and biometric (gates pose ingest). No 0–3 ladder.

Required infrastructure.

  • Separate Go service with its own Fargate task, Cat F service-account principal for callbacks
  • Three typed ingest endpoints (POST /v1/pose/frames, POST /v1/media/events, POST /v1/sessions/{id}/end)
  • Signed-token issuance helper in Core API; signed-token verifier in Telemetry API
  • Swap-point interfaces: AggregateStore, AggregateQuery, SessionBuffer, ReplayBlobStore, LandmarkCodec, SignedSessionToken — repository pattern enforces no direct PG/S3 access from handlers
  • CI guard cmd/check-telemetry-bounds rejecting direct imports if the abstraction drifts

Foundation status. Out of scope for foundation. Foundation primitives this stack relies on — events.Bus (1C.3), Cat F service-accounts (1.24), per-purpose consent (1B.9), data classification (1A) — are all in place. Layer 2 ships the actual service. Full design in /telemetry/index.md and /telemetry/api.md.

Cross-references. P10 (audit log — separate concern; Telemetry does NOT consume audit_log). P28 (events.Bus — Telemetry publishes aggregation events here). P39 (data classification — Telemetry's PG aggregate columns + S3 blob class enter the registry). P41 (range-partitioned event tables — pose_rep_metrics and media_buffering_events follow this shape). P47 has a CH-equivalent at Tier 3 (mandatory org_id = predicate guard) for the day a column store joins.


Operational

P33: State Machines

What. Several entities have explicit lifecycle state machines, enforced at the handler layer.

State machines in spec.

  • appointments.status: booked → upcoming → confirmed → inprogress → done | cancelled | noshow
  • forms.status: pending → in_progress → completed → signed (signed is terminal/immutable)
  • treatment_plan_status: draft → pending_approval → active → paused → completed | cancelled | expired
  • patient_session_completions.status: in_progress | completed | partial | skipped
  • webhook_events.status: pending → delivered | failed | exhausted
  • automation_executions.status: success | partial_failure | failure | skipped
  • domain_status (already implemented): pending → verified | failed
  • service_plans.status and patient_service_plans.status: active | completed | expired | cancelled

Required infrastructure.

  • Per-entity state-transition validator in the service layer
  • Audit-log entry on every state change
  • Backward transitions explicitly disallowed where the spec requires it (e.g., signed is terminal for forms)

P34: API Contract Conventions

What. Cross-feature consistency for HTTP responses, pagination, filtering, errors.

Conventions (set once, in Layer 1.6 + 1.7). Full reference at reference/api-conventions.md and reference/error-envelope.md.

  • Error envelope{error: {code, message, fields?}}. Stable snake_case codes; fields only on 422. Implemented at internal/shared/httputil. (Layer 1.6.)
  • Pagination{data: [...], pagination: {page, limit, total}}. Default limit=50, max limit=500, page is 1-indexed. Helper: internal/shared/apiquery.ParsePage + NewEnvelope.
  • Filtering — flat query params (?status=active&owner_id=...). No nested or DSL syntax. Each handler reads the keys it cares about.
  • Sorting?sort=name,-created_at. Comma-separated, - prefix means desc. Allow-listed per endpoint via apiquery.ParseSort(r, []string{...}) — non-allowlisted fields return 422.
  • Idempotency keysIdempotency-Key header (4–128 ASCII letters/digits/-/_). Cached per (org_id, path, key) for 24 hours. Only 2xx responses cached. Opt-in middleware at internal/core/idempotency.
  • API versioning/v1/... prefix; major version in path; new major version = new prefix.

Required infrastructure.

  • Shared error helper in httputil — landed in 1.6.
  • Pagination + sort helpers in internal/shared/apiquery — landed in 1.7.
  • Idempotency-Key middleware in internal/core/idempotency — landed in 1.7.
  • OpenAPI generation — spec-first via oapi-codegen (Go) + openapi-typescript (frontend). Source of truth: apps/docs/openapi.yaml. Re-run make openapi (Go) and pnpm openapi (TS) after editing. Drift test at internal/core/server/openapi/spec_test.go enforces sync.

Foundation status. Standardized in Layers 1.6 and 1.7. First Layer 2 endpoints will exercise pagination/sort/filter against real data; Idempotency-Key gets first real use in Layer 6 (booking + appointment creation).


P35: Activity Timestamps

What. Best-effort "last activity" / "last seen" timestamps. Different from audit log — these are display metadata, not compliance data. Acceptable precision: minutes.

Columns.

  • humans.last_activity TIMESTAMPTZ — bump on every authenticated request from a human principal.
  • organization_memberships.last_used_at TIMESTAMPTZ — bump on every staff request scoped to that org. Works for any principal type — humans today, agents and service accounts when those middlewares ship.
  • patients.last_used_at TIMESTAMPTZ — bump on every patient request scoped to that org. Patients are not memberships post-1.26, so the org-scoped bump splits by session shape: staff sessions write the membership column, patient sessions write the patient column.
  • organizations.last_activity_at — intentionally not stored; derived from MAX(organization_memberships.last_used_at) WHERE organization_id = ?.

Where it applies. Console "inactive humans" view, per-human "memberships across orgs," admin "last login," potential auto-cleanup jobs.

Required infrastructure.

  • Middleware-side bump (simple) or Redis-buffered batch update (kinder to DB) — design decision
  • Documented as best-effort, NOT a substitute for audit log

Foundation status. Implemented at Layer 1.11 (internal/core/activity/); patient-session split landed at Layer 1.26. humans.last_activity, organization_memberships.last_used_at, and patients.last_used_at are all bumped by a middleware-side throttled writer (~once per minute per key, in-process cache); the org-scoped path picks the right column from Subject.IsPatientSession. organizations.last_activity_at is intentionally NOT stored — derived from MAX(organization_memberships.last_used_at) WHERE organization_id = ? when a UI surface needs it. See reference/activity-tracking.md for the full convention.


P36: Reserved Columns Inventory

What. Columns reserved on existing tables for future features so they're a column-not-table addition later.

Reserved columns in spec. Catalog with current write status lives at architecture/reserved-columns.md — single source of truth, kept in sync as columns light up. Don't duplicate the per-column status here; it drifts.

Foundation status. Inventory pattern adopted at Layer 1.12. The /new-migration skill calls out the inventory when scaffolding tables that need reserved columns.


Commercial Model

P37: Snapshot-on-Subscribe

What. When a tenant subscribes to a tier (or attaches an add-on, buys a usage pack), the tier's entitlements and limits are snapshotted onto the subscription rather than referenced by FK. Tier edits never modify existing subscribers' rows. This is the pricing-grandfathering invariant: a customer who signed up at Pro / 490 RON / 1,000 patients keeps those terms even after Pro is raised to 590 RON / 2,000 patients.

Where it applies. Every per-tenant attachment to a versioned platform-wide catalog where renegotiation must be explicit:

  • organization_subscriptionstiers (the canonical case)
  • organization_subscription_entitlements snapshots tier_entitlements at subscribe time
  • organization_subscription_limits snapshots tier_limits at subscribe time
  • patient_service_plans snapshots service_plans.sessions_total etc. at enrollment (Area 3) — same pattern, predates this catalog entry

Why not just FK to the live tier. Three reasons:

  1. Legal. Silent contract change after acceptance is fraught — RO and EU consumer law gives standing to complain.
  2. Accounting. Auditors need to know "what did this customer have between dates X and Y" without time-traveling the catalog.
  3. Operational. Pricing experimentation is unsafe if every change ripples to existing customers — you'd be too scared to publish.

Required infrastructure.

  • A _versions table on the catalog (tier_versions for tiers), append-only (P14a), with full JSONB snapshots of entitlements+limits+metadata at each publish.
  • A snapshot table per derived dimension (organization_subscription_entitlements, organization_subscription_limits) populated at INSERT of the subscription.
  • A subscription column tier_version INT NOT NULL recording which catalog version was snapshotted.
  • An override layer (organization_subscription_overrides) for sales-granted exceptions that compose on top of the snapshot without modifying it.
  • Resolver: at request time, read snapshot rows + active overrides; never join to tiers/tier_entitlements directly.

Cross-references. tiers-and-subscriptions.md § Catalog tables / Per-org subscription tables. P14b (immutable-after-state-transition) for the same shape applied to forms and signed PDFs. P18 (versioning + snapshots) is the more general data-modeling parent of this pattern.

Foundation status. Schema designed; lands at Layer 1.20. Resolver lives at Layer 1.22 (callable infrastructure with stub semantics until first feature consumer arrives).


P38: Entitlement-Based Regulatory Gating

What. Clinical/regulated code reads only from a typed organization_entitlements table — never from plan / subscription / entitlement tables directly. The plan engine writes entitlements (projecting organization_subscription_entitlements where entitlements.regulated = TRUE), but the regulated code paths never reach back into the plan engine. This decouples the SaMD verification scope (IEC 62304) from the commercial pricing engine.

Why this shape. If clinical code read organization_subscription_entitlements directly, the plan engine + the subscription state machine + every billing-related table change would sit inside the regulated-software boundary. Every Pro→White-Label pricing experiment would trigger regulatory re-validation. The organization_entitlements projection compresses the regulated read surface to a small, stable, typed table whose only writers are the plan engine (one direction, audited) and superadmin (manual override path).

Where it applies. Every clinical / SaMD-scope entitlement that needs a per-tenant on/off:

  • telerehab_enabled — gates treatment-plan creation, exercise prescription, telerehab portal flows
  • treatment_plans_enabled — finer-grained sub-toggle for clinics that want telerehab without the full plan model
  • video_consultations_enabled — gates Daily.co / WebRTC clinical video integration
  • pose_estimation_enabled — gates camera-based measurement (likely Class IIa)
  • New regulated entitlements add columns; the table is typed and queryable for compliance audits.

Required infrastructure.

  • organization_entitlements table with one BOOLEAN column per regulated entitlement, all default FALSE (fail-closed).
  • RLS: SELECT for org members; no UPDATE policy on AppPool — only AdminPool (superadmin) writes. This is the trust boundary.
  • SQL helper current_app_has_org_entitlement(entitlement_code TEXT) callable from RLS policies and Go code.
  • Subject.HasOrgEntitlement(code) helper backed by a per-request load of the row.
  • RequireOrgEntitlement(code) middleware sugar (or inline principalCtx.HasOrgEntitlement checks).
  • Plan-engine projection service SubscriptionService.RecomputeOrgEntitlements(orgId) that runs on subscription/override mutations and writes entitlements one-way.
  • Audit on entitlement changes uses action_context = 'org_entitlement_change' for distinct retention/alerting.

The single load-bearing rule. Regulated Go code reads principalCtx.HasOrgEntitlement(code). Never principalCtx.HasTierEntitlement(code) and never the subscription tables. The reverse direction (tier engine writing entitlements) is fine and expected. This rule is what keeps the SaMD scope decoupled from commercial logic.

Manual override. Superadmin can write organization_entitlements directly via Console UI without going through the plan engine. Use cases: incident kill-switch, pre-plan-engine state (Layer 1 ships before any regulated entitlement appears in a plan), per-org regulatory exceptions (e.g. a US clinic blocked from pose_estimation_enabled until HIPAA review). The plan-engine projection is convergent — manual disables get re-enabled on the next projection unless the underlying entitlement is removed from the org's plan.

Cross-references. middleware-composition.md § Regulated boundary, org-settings.md § organization_entitlements, P3 (per-org permission RBAC) which composes with this pattern at the request gate.

Foundation status. Schema lands at Layer 1.19. SQL helper + middleware + Subject extension land at Layer 1.22. Tier-engine projection service lands at Layer 1.20 as a no-op (no regulated entitlements in seeded tiers yet); activates when the first regulated entitlement appears in a tier.


Data Egress

P39: Egress Data Classification

What. Every column the platform stores is classified once in the registry at architecture/data-classification.md — a class (what kind of data it is) and a list of allowed egress targets (where it may flow outside the tenant). Egress paths consult the runtime helper before constructing a payload; they never hand-build the field list. Default is block: a column missing from the registry, or with no matching egress target, cannot leave the tenant.

Where it applies. Every code path that pushes data outside the tenant boundary:

  • bulk_export — GDPR Art. 20 patient portability exports
  • analytics_internal — placeholder for any future genuinely cross-tenant aggregate egress (Telemetry P32 itself does NOT consume this target — readers are clinic-scoped, see decisions.md → Why telemetry is PG + S3, not ClickHouse)
  • webhook_egress — outbound webhooks (P29)
  • marketing_email — Layer 8 marketing campaigns
  • support_export — break-glass support exports
  • ai_clinical_drafting, ai_admin_summarization — AI placeholders, light up when the first AI feature ships

The single load-bearing rule. Egress code calls classification.AllowedFor(table, target) or classification.Filter(record, target) and uses the result to build the payload. It never references columns by name without going through the helper. This rule is what keeps "what can leave" auditable from a single doc rather than a code review across every egress site.

Required infrastructure.

  • Markdown registry at apps/docs/architecture/data-classification.md — one row per (table, column), source of truth.
  • CI check at services/api/cmd/check-classification parsing migrations + registry — fails the build if a schema column is missing from the registry, the registry references a non-existent column, or a class/target name is undefined. Wired into make check and the GitHub Actions PR pipeline.
  • Runtime helper at services/api/internal/shared/classification/ parsing the registry once at startup. Exposes AllowedFor(table, target) []string and Filter(record any, target string) any. Process refuses to start if the registry is malformed.
  • Every new migration adds a registry row per new column in the same PR. The /new-migration skill reminds the author at scaffolding time.

Cross-references. P11 (sensitive data redaction in logs) — same instinct in a different place: never hand-build payloads that may contain sensitive fields. P32 (Telemetry ingest) classifies its PG aggregate columns + S3 blob class in the registry. P29 (webhook delivery) consumes webhook_egress. The AI-first ADR's Hook 3 (decisions.md → Why no AI-first architecture) calls out classification as one of the three foundation hooks AI requires.

Foundation status. Implemented at Layer 1.25. Registry, CI check, and runtime helper all land in the same layer. First runtime consumers are Layer 8 (webhooks, marketing email).


Tenancy Sub-Structure

P40: Locations as Logistics Layer

What. A clinic (organization) may operate at one or more physical locations (branches / sites). Locations are a logistics layer on top of org-scoped tenancy, not a second scoping dimension. The org remains the trust boundary, the controller, and the RLS scope; locations partition appointments, schedules, and availability for operational use without fragmenting permissions, consents, or patient identity.

Stays org-wide (does not fragment by location).

  • Patient identity and patient_profiles — one record per clinic; the patient may book at any of that clinic's locations.
  • Consents — the consent ledger is per-(patient, org, purpose). A patient consents to "this clinic," not "this branch."
  • organization_entitlements — telerehab / video / pose flags are org-wide. No per-location entitlement fragmentation in v1.
  • Org-level subscription, billing, tiers — one org, one subscription.
  • Audit log — no audit_log.location_id column. Audit rows reach a location via the entity they reference (appointment, calendar, etc.).
  • RBAC — no per-location permission scoping. All staff in the org see all locations. Per-location restrictions can be added later via a permission-scope filter; intentionally deferred until a real customer requires it.
  • RLS — no current_app_location_ids() helper. Org-scoping (organization_id = current_app_org_id()) remains the only RLS dimension.

Partitions by location.

  • Specialist availability: specialist_weekly_hours.location_id NULL-able (NULL = remote / telerehab availability).
  • Schedule overrides: specialist_schedule_overrides.location_id NULL-able.
  • Specialist↔location mapping: specialist_locations(specialist_id, location_id) many-to-many — a specialist can practice at multiple locations.
  • Calendars: calendars.location_id NULL-able. Some calendars are location-pinned ("Centru initial assessment"); telerehab calendars are org-level.
  • Appointments: appointments.location_id NULL-able. NULL = remote / telerehab session; set = in-person at that location.

The single hard invariant: one true availability per specialist.

A specialist physically cannot be in two places at once. The data model enforces this with a DB-level exclusion constraint, not application logic:

sql
-- Illustrative; real shape lives in the migration when specialist_weekly_hours ships.
ALTER TABLE specialist_weekly_hours ADD CONSTRAINT no_specialist_overlap
  EXCLUDE USING gist (
    specialist_id WITH =,
    day_of_week WITH =,
    tstzrange(start_time, end_time, '[)') WITH &&
  );

The same constraint shape applies to specialist_schedule_overrides for date-bound entries. The implication: locations partition the labels on a specialist's availability, not the availability itself. The booking engine (F4) computes free slots from the union of all the specialist's commitments across all locations + remote, then filters by the requested location at display time — never the other way around.

Org with zero locations — a pure-telerehab clinic operates without any physical premises. All appointments carry location_id = NULL. The booking UI skips the location picker; no "Virtual" placeholder row is created. Schema must explicitly allow 0 rows per org.

Org with one location — UI auto-picks the only active location at booking time; no picker shown. Schema-wise indistinguishable from the multi-location case.

Where it applies. locations table + every clinical table that references a physical place: specialists, specialist_weekly_hours, specialist_schedule_overrides, specialist_locations, calendars, appointments, future rooms if/when in-person resource booking ever ships.

Required infrastructure.

  • locations table — see data-model.md § Area 1.
  • locations.manage permission seeded for the admin system role template.
  • ResolveSchedulingTimezone() helper (see P23) — the location's timezone is the first link in the resolution chain.
  • Address fields are structured (address_line1, address_line2, city, county, postal_code, country), never freeform single-line — only chance to do this right before clinic-side data accumulates.

What we deliberately did NOT do.

  • No current_app_location_ids() RLS helper. Org-scoping is the only RLS dimension.
  • No location_entitlements table. Entitlements are org-wide.
  • No per-location billing / pricing differentiation. One subscription per org.
  • No services_per_location table — the service catalog is org-wide. If a clinic ever needs "spinal MRI only at Centru," that's a future ADR, not v1.
  • No inter-location transfer workflow — point the next appointment at the other location.

Foundation status. Schema lands at Layer 1B.14. Specialists, calendars, appointments take their location_id columns when they ship (Layer 2 — F4 / F5).

P41: Range-Partitioned Event Tables

What. Tables that record events (one row per occurrence, append-only, time-ordered, multi-year retention) are range-partitioned monthly on their primary timestamp column from day one. Tables that record state (mutable rows, queried by entity ID, lifecycle bound to other entities) stay flat.

The heuristic: events get partitioned, state doesn't. A row representing a happening (an audit entry, a webhook delivery attempt, a notification send, an AI inference, a sensor reading) is partitioned. A row representing a thing (an appointment, a patient profile, a treatment plan, a consent) is not — even if it has a created_at column.

Where it applies (today and as features land).

  • Today: audit_log and audit_ai_provenance (range by month on created_at / audit_log_created_at). See foundation 1A.15.
  • When the feature ships: webhook delivery log (Area 13 / F8), notification send log, AI agent action log, session/login events, patient measurement timeseries (sensor data from telerehab features). Each of these gets its partitioning baked in at table creation, never retrofitted.

Where it does NOT apply.

  • appointments, forms, treatment_plans, prescriptions, consents, patients, principals — state records. Volume is bounded by patient × visit × clinic; queried by ID, not by time range; retention is "forever" but row count grows linearly with active patients, not with mutations. A flat table with the right indexes carries these for the platform's lifetime.

The four design choices, fixed for every event table.

  1. PK includes the partition key. Postgres requires every unique index on a partitioned table to include the partition key. So audit_log PK is (id, created_at), not (id). id remains logically unique; the timestamp participates only at the storage layer.
  2. FKs to a partitioned event table are composite. Any sibling table that references an event row carries both the FK column and the partition timestamp, FK'd as (parent_id, parent_created_at) → parent(id, created_at). The sibling typically partitions on the same window so the parent + child handoff together at archive time. Application code captures both via INSERT ... RETURNING id, created_at.
  3. No DEFAULT partition. A missed rollover surfaces as INSERT failures rather than silently piling rows into a default that later blocks attaching new ranges. Loud failure is preferable to silent fall-through, especially for audit-grade tables where gaps are themselves a compliance finding.
  4. Migration seeds the current month only; the rollover cron extends the runway. Long pre-seeded windows mask cron failure during staging soak — by seeding minimally, the rollover (cmd/audit-partition-roll, default -ahead=3) is exercised in real environments. Same principle generalises to any cron-fed fixture.

Required infrastructure (per-table, when adding a new event table).

  • Add the table to audit.partitionedTables (or build a sibling registry if the rollover needs different cadence).
  • Mirror the audit_log shape: PARTITION BY RANGE (timestamp_column), composite PK, monthly children with _YYYY_MM suffix, range ['YYYY-MM-01', '(YYYY-MM+1)-01').
  • Wire into the staging scheduler's existing CronJob, not a new one.

What we deliberately did NOT do.

  • No pg_partman (yet). The rollover is ~50 lines of Go; pg_partman adds a dep + an extension to the SOUP inventory for value the in-house cron already delivers. Re-evaluate if a future event table needs sub-monthly partitions or retention-driven detach automation.
  • No hash partitioning by org_id. Hash-by-tenant fits enterprise SaaS with a few huge tenants; this platform targets small-to-medium clinics where RLS handles isolation and partitioning would add complexity without payoff.
  • No partition on state tables (appointments, forms, etc.). See "Where it does NOT apply" above. The threshold to revisit: a state table approaching ~50M rows where retention sweeps would DELETE >10M at once. None of the platform's state tables hit that under realistic projections.

Foundation status. Audit_log + audit_ai_provenance partitioned at Layer 1A.15. Future event tables follow the same pattern at their creation migration.


Frontend Performance

The Next.js apps (clinic, portal, console) talk to the Core API on the server side — server components fetch data inside RSC, server actions call the API on form submit. At fleet scale (5–15 Next.js instances × 5–10 Core API instances) the boring infrastructure layers — HTTP keep-alive, connection pooling, and stale-while-revalidate caching — dominate the latency budget. Three patterns lock the conventions in.

P42: Server-Side Response Caching with Scope-Keyed Tags

What. GET endpoints in packages/api-client that read slow-moving data opt into Next.js's unstable_cache by passing a cacheTags array. The tag string is used as BOTH the cache key AND the invalidation tag — encoding the data scope (platform / org / user). Server actions that mutate the underlying data invalidate via updateTag() from next/cache. The tag taxonomy lives in packages/api-client/src/cache-tags.ts and is the single source of truth for both sides.

Why unstable_cache, not Next.js's fetch tags

Next.js's built-in cache: "force-cache" + next: { tags } mechanism hashes the entire request headers object into the cache key — only traceparent / tracestate are excluded. Every Clerk JWT regeneration changes the Authorization header → different cache key → cache writes happen but hits never do. Verified live in foundation 1D: the fetch-tag approach produced 5+ on-disk entries for the same URL with zero hits across consecutive requests. unstable_cache is the higher-level API where the cache key is OUR choice — using the data-scope tag string instead of request properties.

Next.js 16: updateTag not revalidateTag from server actions

Next.js 16 split the API: updateTag(tag) is the server-action variant (read-your-own-writes); revalidateTag(tag, profile) is the cache-purge variant for ISR / route handlers. Always use updateTag from server actions so the acting principal's next read sees the new state.

The invalidation contract is two calls — invalidate AND refresh()

Cache invalidation alone (updateTag, revalidatePath) marks the cache as stale for the next read. It does NOT trigger the currently loaded page to re-render. Without refresh(), the user sees stale data until they manually navigate or reload.

After every server action that mutates user-visible state, the contract is:

  1. Invalidate the cache layer for the affected data — updateTag(CacheTags.x(...)) if tagged, revalidatePath(...) if path-cached but untagged
  2. refresh() from next/cache — tells the current client to re-fetch the route's RSC payload immediately

refresh() is the new Next.js 16+ function, explicitly designed for this case (the type definition: "useful as dynamic data can be cached on the client which won't be refreshed by updateTag"). It's a no-op when called after a navigation that's about to happen anyway (e.g. redirect()), so the safe rule is: always call refresh() after invalidation in server actions that don't redirect(). Discovered live: the patient consents grant action invalidated correctly but the page kept showing the old version until manual refresh — refresh() was the missing call.

The four tag namespaces — and what they're safe to share.

NamespaceVisible toInvalidated by
platform:{resource}Every caller, every tenantSuperadmin Console mutation
org:{id}:{resource}All staff at one orgOrg-admin server action at that org
me:{principal_id}One principalSame principal's own server action
org:{id}All staff at one org (broad)Anything that fans out across an org

Tag mismatch = cross-tenant data leak. Tagging a per-org response with platform:* would serve org A's data to org B on the next read. The taxonomy enforces scope; new endpoints pick the namespace whose visibility matches what the response contains.

When to use which.

Endpoint shapeTagWhy
Read-only catalog, slow-moving (plan catalog, legal templates, role definitions)platform:{resource}Read by every tenant; invalidated by superadmin only — cache hit ratio approaches 1.
Per-tenant state read on every page (org details, member list, branding)org:{id}:{resource}Read by all staff at one org; invalidated by org-admin actions.
Per-user state (my consents, my profile)me:{principalId}Invalidated by the same principal's own mutations.
Hot mutable state (today's appointments, current chat, in-flight booking)NO tag — keep cache: "no-store"Freshness cost of stale read > latency win.

Required infrastructure.

  • packages/api-client/src/cache-tags.tsCacheTags.{org,orgResource,platform,me} builders. Hand-rolling tag strings drifts; always go through these.
  • ApiClient.request() accepts { cacheTags?: string[] } — when present on a GET, wraps the fetch in unstable_cache(fn, tags, { tags, revalidate: false }). Default stays cache: "no-store".
  • Mutations: server actions call updateTag() (Next.js 16+) with the matching builder before returning.

Discipline.

  • Read site and write site go through the SAME CacheTags.x() builder. A typo on either side means cache miss forever or stale read forever (Next.js's tag match is exact string).
  • Tag tagged endpoints in packages/api-client/src/client.ts, never inline at the call site. Centralizes the choice and keeps the convention auditable.
  • The /new-domain slash command template includes the tagging step — every new resource decides up front whether it qualifies.

Canonical wired example. getOrganization(id) in packages/api-client/src/client.ts tags with org:{id}:summary — the same scope identifier the Core API service uses as its Redis P45 key. The Console updateOrganizationAction calls updateTag(CacheTags.orgResource(id, "summary")) after a successful update; the Core API Service.Update path invalidates the matching Redis key. Both layers stay in sync from one server-action call.

Cache safety. Per-org cached endpoints REQUIRE P47: URL ≡ Scope Guard on the route group. RLS hides the first mis-scoped request, but the Next.js cache then serves that response to every future caller of the same URL. The middleware closes the gap at the route layer.

Additional wired examples.

  • listLegalDocumentTemplates + listLegalDocumentStaleOrgs (platform scope) — Console template management; demonstrates platform:* tagging + invalidation from publishLegalDocumentTemplateAction. Low real-world cache value (Console superadmin scope, ~3 users), kept as a simple platform-scope reference.
  • listConsentPurposes (platform OR org scope, depending on organizationId) — Portal-critical. Read by every patient on every consent gate check; this is the high-traffic case at Phase 2 scale.
  • listPlatformConsentPurposeVersions (platform scope) — Console superadmin-only counterpart for the platform-legal-docs editor.

Live-verified at 1D. Hit + refresh on the canonical example = 1 GET + 0 GET (cache hit); publish + refresh = 1 POST + 1 GET (invalidated, repopulates) + 0 GET (cached again). Run against a production-built Console.

Foundation status. Pattern + scaffolding + canonical example in place at Layer 1D. Existing endpoints get tagged organically as features touch them; new endpoints decide via /new-domain.

P43: Tuned undici Dispatcher

What. Next.js's fetch() delegates to undici. Undici's default keepAliveTimeout is ~4s — at fleet scale (multiple Next.js instances under bursty traffic) idle moments cycle TCP connections to Core API, causing TIME_WAIT pile-up and reconnect-latency spikes. We install a tuned global Agent in each app's instrumentation.ts (keepAliveTimeout: 30s, connections: 64 per origin). 30s is well under the Core API's IdleTimeout: 60s, so the client never tries to reuse a connection the server just closed.

Required infrastructure.

  • packages/api-client/src/dispatcher.tsinstallCoreApiDispatcher().
  • apps/{clinic,portal,console}/instrumentation.ts — Next.js register() hook, gated on NEXT_RUNTIME === "nodejs". Edge runtime cannot install a dispatcher (sandbox).
  • undici direct dep on packages/api-client (SOUP row, Critical risk).

Foundation status. Wired at Layer 1D. Same dispatcher tuning ships to AWS — the Next.js apps run on ECS Fargate in the Node runtime.

P44: Connection Pooling via pgbouncer

What. The Core API talks to Postgres through pgbouncer in transaction pool mode. Locally pgbouncer runs as a docker-compose service on port 6432; on AWS it runs as an ECS Fargate service alongside the application services, reached over the private subnet. Migrations bypass pgbouncer (they use session-scoped pg_advisory_lock); runtime traffic always goes through it.

Why. At fleet scale (5–10 Core API instances × DB_POOL_MAX=25 × 2 pools = 250–500 conns), Postgres's default max_connections=100 + ~10MB RSS per conn breaks down. pgbouncer multiplexes many app connections onto a small set of backend connections. RDS Proxy is the AWS-native alternative; it pins on prepared statements (which pgx uses by default), defeating the multiplexing benefit — see aws-infrastructure.md for the full trade-off.

Compatibility.

  • Transaction pool mode is compatible with our setup because RLS session vars are scoped to the request transaction (set_config(..., true)), not to the connection — see feedback_rls_session_vars_via_ctx_conn.
  • Session-mode features (advisory locks, LISTEN/NOTIFY, SET, temp tables) are NOT used in this codebase. New code MUST NOT introduce them — code review enforces.
  • pgx's default QueryExecModeCacheStatement (named prepared statements) is supported by pgbouncer 1.21+ via protocol-level prepared statement tracking. max_prepared_statements = 200 in pgbouncer.ini; without it, prepared statements would pin connections.

Required infrastructure.

  • services/api/deploy/pgbouncer/pgbouncer.ini + userlist.txt — local config. Production uses scram-sha-256 + auth_query (see deployment doc).
  • services/api/docker-compose.yml pgbouncer service — edoburu/pgbouncer:v1.25.1-p0.
  • DATABASE_URL / DATABASE_APP_URL point at port 6432; DATABASE_DIRECT_URL points at 5432 for migrations.
  • Makefile migrate-up / migrate-down use DATABASE_DIRECT_URL.

Foundation status. Local docker-compose + Makefile wiring at Layer 1D. AWS ECS Fargate deployment lands with the AWS migration (see aws-infrastructure.md).

P45: Redis-Backed Query Cache (Cache-Aside)

What. Repository read paths in Core API wrap their DB queries with cache.Aside from internal/core/cache. The first request populates Redis with the JSON-marshalled response; subsequent requests at the same data scope read from Redis and skip Postgres entirely. Write paths invalidate by calling cache.Invalidate(ctx, redis, key) listing every cache key whose response could now be stale.

Why this layer matters even though P42 already caches at Next.js.

Next.js's unstable_cache is per-Next.js-process. With N Next.js instances behind a load balancer, each instance maintains its own cache — first request to each instance still hits Core API. Redis at Core API is shared across the entire fleet: the first request from any Next.js instance populates the cache for all of them.

At 10k concurrent Portal users across one clinic:
  Without Redis: 10 Next.js instances → 10 Postgres queries (one per instance)
  With Redis:    10 Next.js instances → 10 Core API calls →
                  1 Postgres query (the very first), 9 Redis hits

The two layers compose:

  • Next.js cache hit: 0 Core API calls
  • Next.js miss + Redis hit: 1 Core API call, 0 Postgres queries
  • Next.js miss + Redis miss: 1 Core API call, 1+ Postgres queries

Cache key namespace mirrors P42. Use the same scope identifiers. cache.Platform("plan-catalog") produces platform:plan-catalog — grep for that string and find every read AND every invalidation across both layers (CacheTags in packages/api-client/src/cache-tags.ts, key builders in services/api/internal/core/cache/).

HelperProducesUse for
cache.Platform(parts...)platform:{joined}Read responses identical for every caller
cache.OrgResource(orgID, parts...)org:{orgID}:{joined}Read responses identical for every staff at one org
cache.Me(principalID, parts...)me:{principalID}:{joined}Caller-bound state (use sparingly — high mutation rate often makes caching net-negative)

When to wrap a read. All of these must hold:

  • Response is identical across multiple concurrent callers at the same scope (clinic branding for every patient at that clinic; platform plan catalog for everyone).
  • Read:write ratio > ~10:1 (data changes infrequently relative to read frequency).
  • Stale reads up to TTL are acceptable. If users would notice 5-min-old data, lower the TTL or don't cache.

Invalidation rule. Each write path knows the specific keys it produces. After a successful mutation, call cache.Invalidate(ctx, redis, key1, key2, ...) listing every key whose response could now be stale. There is no broad-pattern invalidation at this layer — SCAN + DEL to evict "all org:X:* keys" blocks Redis under load, and tracking sets add complexity. Broad invalidation lives at the Next.js layer (P42 tag taxonomy). This layer is per-key-explicit by design.

TTL choice. Pick the shorter of:

  • The longest staleness users would tolerate
  • Twice the typical write interval for the scope

5 min is a reasonable default for slow-moving catalog data. Shorter (30s–1min) for "list of available time slots" style data where freshness still matters. Longer (1h+) only for genuinely immutable platform data.

Required infrastructure.

  • services/api/internal/core/cache/cache.goAside, Invalidate, Platform, OrgResource, Me.
  • The service struct holds redis *redis.Client. Constructor accepts it. Tests pass nil to disable caching (helper degrades to a direct DB call — same code path).
  • The cache is a performance layer, not a correctness layer. All Redis errors are logged but never propagated; the underlying fetch is always authoritative. If Redis is down, the request proceeds against Postgres at full cost.

Discipline.

  • Cache scope must match data sensitivity. Tagging a per-org response with platform:* would serve org A's data to org B on the next read. Use the helper whose namespace matches what the response contains.
  • Write site lists every affected key explicitly. Don't try to "be clever" with SCAN/DEL at this layer.
  • If two cache scopes (Next.js unstable_cache and Redis here) cache the same data, both invalidations must fire from the write path — Next.js updateTag AND cache.Invalidate. Otherwise one layer serves stale.

Cache safety (load-bearing). Per-org cached endpoints inherit a security risk that doesn't exist before caching: the FIRST authorized B-member request populates org:B:* in Redis; an A-member request with URL=B then hits the cache and reads B's data. RLS only protected the first call. The fix lives at the route layer — see P47: URL ≡ Scope Guard.

Canonical wired example. organization.Service.GetByID wraps s.repo.FindByID with cache.Aside keyed by org:{id}:summary. Read on every Console org-detail page render and (in the future) every Clinic dashboard / Portal "my clinic" page. Invalidated from Service.Update via cache.Invalidate(orgSummaryKey(id)). The route group enforces URL ≡ scope.

Additional wired examples.

  • organization/service.goResolveBySlug and ResolveByDomain wrap their DB queries with cache.Aside. Hit by every proxy.ts resolution from every Next.js app on every cold-load. Platform scope because resolve responses aren't tenant-scoped — they're platform reads about which tenant a request belongs to. Invalidated from Update (slug/domain change).
  • consents/service.goListPurposesWithLatestForOrg wraps its two-query fanout with cache.Aside. Routes between platform-wide (orgID == nil) and per-org keys based on caller scope. Invalidated from PublishPlatformPurposeVersion (DEL platform key only — per-org entries refresh via TTL because broad invalidation would require SCAN/DEL).

Foundation status. Helper package + canonical example (getOrganization) + two additional wired examples (Resolve*, ListPurposesWithLatestForOrg) + URL≡scope middleware at Layer 1D. New domains decide via /new-domain whether to wrap their reads, and per-org domains MUST mount RequireURLOrgMatchesScope on their route group whether or not they cache today.

P46: Portal Hybrid Architecture (Server Shell + Client for Live Bits)

What. The Patient Portal is the highest-traffic app in the platform (target: 10k+ concurrent at Phase 2). Its rendering split is intentional: server-render the shell + slow-moving data; client-render the live interactive bits. This is not a per-feature decision — it's a per-data-type one, and it's the same answer for every Portal feature.

Why hybrid, not pure server or pure client.

Pure server (RSC everything)Pure client (SPA)Hybrid (this)
✓ Best TTFB (HTML streamed with data)✗ Worst TTFB (JS load → fetch waterfall)✓ Best TTFB for the shell
✓ Auth handled server-side, no token plumbing✗ Token round-trips to every API call✓ Auth server-side for the shell
✓ Cache shared across users via P42 + P45✗ Each user has their own cache✓ Server reads share cache; client reads stay private
✗ Re-renders fire fresh fetches on every nav✓ Stale-while-revalidate UX feels instant✓ Server for navs that change page; SWR for live updates
✗ Bad fit for live data (slot grid, chat, presence)✓ Built for it✓ Client owns those bits
✗ Server fleet scales with traffic✓ Server fleet stays small✓ Bounded server load

The data-type decision matrix for Portal.

Data typeRendersCache layerWhy
Page shell (layout, header, navigation)Server (RSC)unstable_cache me:{id}:shell (short TTL)Fast TTFB, auth + tenant resolution server-side
Clinic branding / profileServer (RSC)unstable_cache org:{id}:branding + Redis org:{id}:brandingIdentical for every patient of one clinic, both layers compose
Platform legal docs (terms, privacy)Server (RSC)unstable_cache platform:legal-* + Redis platform:legal-*Highest cache hit ratio in the system
My profile, my appointments, my consentsServer (RSC)unstable_cache me:{id}:*Private to caller; per-user cache acceptable
Service catalog / available specialistsServer (RSC)unstable_cache org:{id}:services + RedisSlow-moving; rendered on browse pages
Available time slots (booking)Client (SWR)Browser cache + Redis at API (short TTL ~30s)Race-condition matters; needs near-live freshness
Live chat / video presenceClient (WebSocket)No cache, persistent connectionMust be live
Notifications / unread countsClient (SWR or SSE)Short browser cache + auto-pollStale-while-revalidate UX
In-progress booking / consent flow form stateClient (React state)NoneLocal, ephemeral, never shared
Mutations (booking submit, exercise complete, consent grant)Server actionsInvalidate matching tags via updateTag + cache.InvalidateSingle round-trip, both cache layers stay in sync

Tooling decisions.

  • Server-rendered data — handled by the existing api-client + P42 + P45 stack. No new tooling.
  • Client-rendered live data — use SWR (swr package). Lightweight, focused on stale-while-revalidate, the simplest possible match for "show cached data immediately, then revalidate in background." Don't install until the first feature needs it; install with a SOUP entry at that point. Don't install React Query — it's heavier and the mutation/invalidation surface duplicates what server actions + updateTag already do for us.
  • Push channels (live chat, video presence, slot updates) — use WebSocket via Daily.co for video/chat (already in the stack), SSE (Server-Sent Events) from Core API for one-directional pushes (slot updates, notifications). Skip polling at scale; SSE is cheaper.

Discipline.

  • A new Portal feature that adds a "live" interaction starts with the question: what's the right cache layer? If the data type is in the matrix above, follow it. If not, add it to the matrix in the same PR — the matrix is the single source of truth for Portal data architecture.
  • Don't migrate server-rendered things to client just because "it would feel snappier." Server-rendering with cache hits IS snappy and shares cache across users; client-side fetches duplicate work per user.
  • Don't render live data server-side just because everything else is server. A 30-second-stale slot grid causes booking conflicts.

Cross-app application.

AppSame hybrid?Notes
PortalYes — the canonical caseSized for 10k+ concurrent, biggest cache wins
ClinicSame patterns, lighter scale10–50 staff per clinic; cache hit rate per org:{id}:* key very high; minimal client-side work needed (most actions are CRUD via server actions)
ConsoleServer everywhere1–5 superadmins. No client-side caching investment justified. What we have suffices.

Foundation status. Pattern documented at Layer 1D. SWR install + first wired client-cache example land with the first Portal F-feature that needs them. Don't install or scaffold ahead.

P47: URL ≡ Scope Guard (Cache-Safe Per-Org Routes)

What. Every per-org route group mounts middleware.RequireURLOrgMatchesScope("id") (or the equivalent guard for body / query-string tenant IDs). For non-superadmin callers, the guard returns 403 when the URL parameter naming a tenant resource doesn't match the principal's CurrentOrganizationID (set from X-Organization-ID). Superadmins bypass.

Why it's load-bearing. Without the guard, a non-superadmin member of org A can request URL=B with X-Organization-ID=A and the handler still reaches the service. RLS hides the row at the DB layer and the request 404s — that's the protection on the first request. As soon as caching wraps the handler (P42 at Next.js, P45 at Core API), the FIRST authorized B-member request populates org:B:* in cache; an A-member request with URL=B then hits the cache and reads B's data. RLS only protected the first call. The cache propagates the leak to every subsequent caller of the same URL.

Where it applies. Every per-org route group whose URL parameter names a tenant resource that could be cached now or in the future. Apply preemptively even on uncached routes — cost is one map lookup per request, benefit is that adding caching later is mechanical. The same rule applies if a future endpoint receives a tenant ID via request body or query string instead of URL — extract the equivalent guard for that surface.

Why the route layer, not the cache layer. RLS protects DB rows. P42 / P45 sit above the DB. The only place to close the gap once and for all is before the cache is consulted — i.e., the route layer.

Required infrastructure.

  • middleware.RequireURLOrgMatchesScope(urlParam) — rejects URL/header mismatch with 403 for non-superadmin callers.
  • Route group declares which URL parameter names the tenant ID.

Cross-references. P42 and P45 — both rely on this guard for cache safety. CLAUDE.md "Frontend Performance Standards" lists URL ≡ scope as a load-bearing rule alongside P42–P46.

Foundation status. /v1/organizations/{id} route group mounts this middleware (Layer 1D). New per-org route groups MUST mount it — code review enforces.

P48: Server Data Flow to Client Components (No useState(serverProp))

What. Client components that receive server-fetched data as props MUST NOT seed local useState from those props without a sync mechanism. useState(initialFoo) only honours the initializer on mount; subsequent server re-renders pass new prop values that the component silently ignores. This breaks the entire P42/P45 + refresh() invalidation chain — the action fires, the cache invalidates, the server re-renders with fresh data, the client component keeps showing the snapshot from mount time.

The anti-pattern.

tsx
// ❌ Anti-pattern: trail freezes at mount, ignores subsequent re-renders.
function ConsentsTrail({ initialTrail }: Props) {
  const [trail, setTrail] = useState<Consent[]>(initialTrail);
  // setTrail is called from optimistic updates, never from refresh.
  // After router.refresh() / refresh() from `next/cache`, the page
  // re-renders server-side with new initialTrail, but useState
  // returns the original mount-time value.
}

The fix — three options, in order of canonical-ness.

  1. Use the prop directly when there's no client-side mutation between renders:
    tsx
    function Component({ items }: { items: Item[] }) {
      return <ul>{items.map(...)}</ul>;
    }
  2. useOptimistic (React 19) for true optimistic updates — showing the change before the server confirms, with automatic revert on failure:
    tsx
    const [items, addOptimistic] = useOptimistic(serverItems, (current, update) => [...]);
  3. useServerSyncedState from @workspace/ui/hooks/use-server-synced-state — when the component needs writable local state for post-action edits AND must stay reactive to server re-renders:
    tsx
    const [items, setItems] = useServerSyncedState<Item[]>(serverItems);

The hook in option (3) wraps the canonical React-docs pattern ("Storing information from previous renders" — hold the previous server value in state, compare during render, sync synchronously if it changed). No useEffect, no flicker, no react-hooks/set-state-in-effect lint trip. Drop-in useState API.

Don't roll your own useState + useEffect sync — it works, but trips the lint rule and adds a one-render flicker.

Live discovery. The Portal consents-trail.tsx had useState<Consent[]>(initialTrail). The user accepted a new consent version via the re-consent modal — server action invalidated correctly, router.refresh() fired, server re-rendered the page with the v2-accepted trail — but the client component kept showing v1 because useState ignored the new initialTrail prop. Fixed by switching to useServerSyncedState (the hook's foundational use case).

Why this is foundation-level. It silently neutralises every other invalidation pattern we shipped. A future agent reading P42/P45/refresh() and applying them faithfully on the server can still produce the "stale UI after action" symptom if the receiving client component freezes its state at mount. The contract has to be enforced at both ends.

Discipline.

  • Server-rendered data passed as a prop is read-only data flow. Never copy it into useState.
  • Local state on a client component is for things the USER controls (form inputs, UI toggles, dialog open state) — not for server data.
  • Optimistic updates use useOptimistic, not useState.
  • Code review: any useState(propName) where propName is server-fetched data is a bug.

Cross-references. P42 — the cache invalidation chain only works end-to-end if the receiving component honours prop changes. CLAUDE.md "Frontend Performance Standards" calls this out as the receiving end of the invalidation contract.

P50: Capability Convention

What. A "capability" is the platform-internal name for a bounded responsibility — send an email, render a PDF, run an AI completion, deliver a webhook — defined as one Go interface with a switchable implementation. Every capability composes the same cross-cutting concerns (permission, quota, provider resolution, audit, metering, error classification) through a small set of wrap helpers in internal/core/capabilities. The convention is that the capability call site picks one of four wrap helpers — one per implementation-strategy category — and the helper's middleware stack handles the rest.

The four implementation-strategy categories. The category is decided by who pays for the call and whether it crosses a trust boundary, not by the provider's name. See glossary.md → Integration Categories for the user-facing category definitions.

CategoryExamplesWrap helper
Cat A meteredemail (SES), SMS (Twilio), video (Daily.co), AI text (Anthropic)WrapMeteredProvider
Cat A unmeteredauth (Clerk), storage (S3)WrapProvider
Cat C outboundwebhook delivery to clinic-configured URLWrapOutbound
Internal LibraryPDF render, signing, encryptionWrapInternal

(Cat B Connected Account, Cat D Inbound Webhook, Cat E Internal Event do not appear in this matrix because they are not synchronous capability calls — Cat B credentials are platform-resolved on a separate per-org code path, Cat D is a route handler the dispatcher delivers INTO, Cat E is the events.Bus and has its own pattern P28.)

The locked wrapper stack ordering. For Cat A metered, top-down (outer → inner):

classifyErrors → meterAfterSuccess → auditCall → resolveProvider → enforceQuota → requirePermission → inner provider call

The ordering is load-bearing:

  • requirePermission first — block unauthorised callers before doing any work.
  • enforceQuota second — block over-quota callers before paying for provider resolution.
  • resolveProvider third — every Cat A call resolves through 1C.2's platform_service_providers table per (org, capability).
  • auditCall BEFORE the provider call — failed calls are auditable with status code; the audit row is committed even when the provider call rejects.
  • Inner call.
  • meterAfterSuccess — failures don't burn quota.
  • classifyErrors outermost — every error returned to the caller is one of the typed sentinels (ErrTransient, ErrPermanent, ErrProviderUnavailable, ErrPermissionDenied, ErrQuotaExceeded, ErrUnauthenticated, ErrPermanent).

Cat A unmetered drops enforceQuota and meterAfterSuccess. Cat C outbound drops everything except auditCall + classifyErrors (permission is checked once at subscription create time; quota and resolve don't apply to outbound webhooks). Internal Library drops everything except classifyErrors (permission/audit happen at the calling layer above).

Functional composition, not struct decorators. Helpers wrap a capabilities.Op[Req, Resp] (a func(ctx, req) (resp, error)) and return a new Op of the same shape. The stack is fixed per category — no runtime customisation needed; struct decorators would speculate against unknown future flexibility.

Principal-type-agnostic. The wrapper stack treats all principal types identically: humans, agents, service-accounts (Cat F), and the system principal flow through the same principal.Subject.HasPermission / Subject.Limit reads. Audit attribution carries actor_id + actor_type; permissions / quotas / metering operate on org scope without consulting actor type. This property is essential — accidentally hardcoding "human" assumptions in any wrapper would break Cat F service-account API calls and autonomous-agent calls when those flows light up. See glossary.md → Principal-type-agnostic primitive.

Test-double convention: Fake{Capability} in same package as interface. Every capability ships a hand-written fake (sync.Mutex + captured slice + FailNext field for failure injection) in the production package — NOT _test.go — so the integration test suite can compose a real consumer against the fake. notify.FakeChannel is the canonical reference; the capabilities README walks the shape. Mocks (gomock, mockery) are explicitly rejected — verbose, brittle, and a field rename becomes a multi-file edit.

Foundation status.

  • The four wrap helpers (WrapMeteredProvider, WrapProvider, WrapOutbound, WrapInternal) ship at 1C.1 with permission and quota gates fully wired against principal.Subject and error classification fully wired.
  • resolveProvider is a no-op forwarder until 1C.2 ships the platform_service_providers resolver and registers it via capabilities.SetResolveFunc.
  • meterAfterSuccess is a no-op forwarder until 1C.7 ships the metering store and registers it via capabilities.SetMeterFunc.
  • auditCall is a hook the capability supplies via MeteredProviderOptions.AuditFunc — the convention exposes the seam, individual capabilities (e.g. ai.text in 1C.8) shape their own audit row through the existing audit.Recorder.
  • The CI guard cmd/check-capabilities enforces the convention at every wiring site (foundation skeleton allow-lists notify; 1C.2 tightens once Cat A capabilities migrate to the resolver).

The notify.Channel exception (load-bearing carve-out). notify.ChannelAdapter follows the capability shape (interface + impl + Fake) but is not wired through the wrap helpers. The notification dispatcher is an async-outbox pattern, not a synchronous capability call:

  • Producer side: notify.Service.Send is called from a request handler that already cleared its own permission and quota gates; running them again inside the channel adapter would double-gate the same authorisation.
  • Consumer side: the dispatcher polls notification_deliveries on a background goroutine — no Subject in ctx, no per-call audit row (the rows themselves are the audit trail), and the retry / dead-letter state machine handles error classification with semantics specific to outbox delivery.
  • Meter side (future): platform email volume is metered against the dispatched delivery count, hooked into the dispatcher's success path directly — not into the producer's Send call.

This is the only exception to P50, and any future async-outbox capability (e.g. SMS dispatcher, future webhook outbox redesign) inherits the same carve-out — synchronous capability calls go through the wrap helpers; asynchronous outboxes do not.

Cross-references. P28 Internal Event Bus — events.Bus is the in-process pub/sub for Cat E and is orthogonal to capabilities (events are publish-and-forget; capabilities return values). P29 Webhook Delivery — Cat C outbound delivery is the reference consumer for WrapOutbound once 1C.4 ships. P31 Connected Account Credentials per Org — Cat B is per-org credentialed and carved out from Cat A (different resolution path). P38 Entitlement-Based Regulatory GatingRequireOrgEntitlement and the wrap helpers compose at the route layer (entitlement gates the route, capability wraps the call).

P51: Code-First Registries with Generated Documentation

A small set of values defined in code with multiple documentation consumers should follow a single pattern: declare in code with rich metadata, expose a runtime registry, dump the registry as JSON / Markdown, and have the docs include the dump rather than duplicate the list. The registry is the source of truth; every other view is read-only.

When to adopt. A registry is the right fit when (a) the values are defined in code anyway (event types, permission codes, entitlement codes, capability names) and (b) the values surface in more than one documentation context (architecture overview, feature spec, generated webhook docs, automation trigger UI, external API contract). Two consumers is enough — one consumer can hand-edit; three almost always drift.

The shape (1C.3 reference implementation: events).

  1. Code-side: each owner declares its entries in a colocated file (internal/core/{domain}/events.go) and registers via init(). The registry package (internal/core/events) owns Register, Lookup, All, and the entry struct (EventDef{Name, ResourceType, Description, Layer, Payload, DeprecatedAt, ReplacedBy}).
  2. Dump binary: cmd/dump-{registry-name}-registry blank-imports the owner packages so init() fires, then serialises the runtime registry. Two formats by convention: -format=json for tooling and -format=md for docs.
  3. Generated artifact: committed at apps/docs/{section}/_generated/{registry}-catalog.md. Marked with an "Auto-generated; do not edit" header. Regenerated via make {registry}-docs.
  4. Doc include: VitePress's <!--@include: ./_generated/{registry}-catalog.md--> injects the generated table into the narrative doc. The narrative explains the conventions; the table is the dump.
  5. CI guard: cmd/check-{registry}-registry validates two invariants: (a) every code-side declaration site has a registry entry (no drift between declaration and registration), and (b) the committed generated file matches the freshly-generated form (no drift between registry and docs). Wired into make check.

Why "registered without a publisher" is OK. A domain may register an event whose first publisher hasn't shipped yet, or an entitlement code whose first plan hasn't been seeded. The registry catalogs intent; the publisher / consumer ships when the feature ships. The reverse — a publisher / consumer with no registry entry — is drift and fails the build. Same direction as P28's events catalog.

Future adopters. Likely fits — per-org permission codes (currently hand-listed in apps/docs/reference/rbac-permissions.md), entitlement catalog (currently a hand-edited grid in tiers-and-subscriptions.md), automation triggers (will need this when F8 ships), webhook events (1C.4 reuses the same registry as 1C.3). Don't preemptively migrate any of these in this PR — adopt when the next docs-drift bug surfaces. Foundation discipline applies: pattern is documented; concrete adoption rides the work that needs it.

Cross-references. P28 Internal Event Bus — first adopter; defines events.Type as the registered name. P39 Column-Level Data Classification — same shape (registry + CI guard) at the column-classification layer; classification predates this pattern but follows the same discipline.


P52: Inbound Webhook Convention

Every Cat D inbound webhook handler — /webhooks/{provider} for any provider RestartiX receives signed pushes from (Stripe, Daily.co, Clerk Svix, SES SNS, Google Calendar push, …) — runs the same five-step flow. Verify, dedup, mutate, mark, emit. Foundation 1C.6 ships the dedup table + repo helpers (internal/core/inboundwebhooks/dedup); each provider's verify + parse + handler lives in internal/integration/{provider}/inbound/ and lands with the F-tier consumer that needs it.

The flow (locked, every handler):

  1. Verify the signature using the per-provider helper. Return 401 Unauthorized on mismatch. The helper exposes Verify(req *http.Request, secret []byte) error (or equivalent shape; signature schemes don't share enough structure to abstract). No fall-through, no logging the raw signature header — these are forgery attempts and the audit row is enough.
  2. Dedup-check by calling dedup.WasProcessed(ctx, provider, eventID). If true, return 200 OK immediately — the provider treats this as ack and stops retrying. No state mutation, no event emission, no audit row. Re-deliveries are an expected protocol behaviour, not an anomaly.
  3. Mutate state via the domain service. Inside the request's tx (the inbound router opens one even though there's no principal context — handler attaches admin-pool tx as the default for AdminPool-shaped writes; per-provider handlers that need to update tenant-scoped tables resolve the org from the payload and bind a principal-scoped tx like the standard request middleware would).
  4. Mark-processed by calling dedup.MarkProcessed(ctx, provider, eventID) in the same tx as the mutation. Both commit together — if step 3 rolls back, dedup does too, and the provider's next retry re-enters at step 2 cleanly.
  5. Emit a Cat E Internal Event via the 1C.3 events registry describing the inbound effect (e.g. appointment.recording_available, payment.received, auth.user_synced). The standard fan-out (audit, notification dispatcher, outbound webhook dispatcher, automations engine) consumes the event downstream — handlers do not call audit/notify/outbound directly.

Why dedup before mutate, not after. A pre-mutate check makes the cheap-path (re-delivery) cheap. A post-mutate-only check would let two concurrent re-deliveries each open a tx, each mutate, and rely on the dedup PK collision at COMMIT to roll one back — wasted work, doubled audit churn, and the mutation might be expensive (PDF re-render, S3 fetch). Pre-check + atomic-mark-in-same-tx is the correct shape.

Why all handlers, not "where it matters." Every provider retries on non-2xx — not retrying is the protocol violation, not the norm. A handler that processes the same event_id twice will produce double state mutations + duplicate Cat E emissions + duplicate downstream notifications. There's no provider where "skip dedup, it'll be fine" holds. Foundation discipline: invariant in convention + CI guard, no opt-out without a documented exception.

The CI guard (cmd/check-inbound-webhooks). Walks every handler reachable from the /webhooks/ router group and asserts: (a) calls a *Verify function before processing, (b) calls dedup.WasProcessed before processing, (c) calls dedup.MarkProcessed after the state update, (d) emits at least one event via the events bus on success. With zero handlers today the guard is a no-op pass; the first F-tier consumer is the first non-trivial run. Allow-list for documented exceptions in cmd/check-inbound-webhooks/allowed-exceptions.md — initially empty.

Per-provider package shape (locked).

internal/integration/{provider}/inbound/
  ├── verify.go    // Verify(req *http.Request, secret []byte) error
  ├── parse.go     // Parse(body []byte) (Event, error)  -- typed event extraction
  └── handler.go   // mounted on /webhooks/{provider}; runs the standard flow

Mount under a sibling router group at /webhooks/ — auth-naked from the JWT side because verification happens at step 1. CSRF doesn't apply (no session). Rate-limited per-provider via 1A.13's existing infrastructure (default inbound_webhook policy: 100 req/sec/provider; sustained breaches indicate runaway loops or attacks).

Per-org inbound tokens. For Cat B providers that push notifications (Google Calendar, Microsoft 365), each per-org connection registers a push channel; the provider echoes back a token we generated. Token lives in organization_integrations.config.push_channel.token (GIN-indexed JSONB scan) — not clinic-facing, purely internal routing. Distinct from Cat C signing secrets (clinic-managed, client-visible at create time).

Cross-references. P28 Internal Event Bus — step 5 emits via the bus. P41 Event-Shaped Tables Partition Monthlyinbound_webhook_dedup is partitioned per the rule. P39 Column-Level Data Classification — dedup columns classified system_metadata. Reference: Inbound Webhook Guide — engineer-facing how-to.


P53: AI Capability Provenance and Variable-Cost Metering

Every AI provider call records (a) one audit_log row + one audit_ai_provenance row in the same transaction so the FK + lifecycle hold without coordination; (b) one or more usage_records rows whose cost_cents snapshots the active row in ai_model_pricing_history AT CALL TIME so closed-period billing reconstruction stays accurate even after pricing edits; (c) a metering reservation reconciled via the deferred Settle / Cancel handle returned from MeterStore.BeginReservation — the wrap layer reserves a max-cost ceiling upfront, the inner runs the call, the inner Settles with actual usage entries (split per direction for input/output token billing) on success or Cancels on partial-failure paths.

Why same-tx for audit + provenance. audit_ai_provenance.audit_log_id is FK'd to audit_log(id, created_at). Writing them in separate transactions creates two failure modes: (1) audit commits, provenance fails → audit row carries no provenance evidence even though the call ran AI; (2) audit rolls back, provenance was never written → still consistent but only because the path bailed. Same-tx eliminates window (1) entirely. The audit.RecordWithProvenance API is the only path that writes audit_ai_provenance; the SQL function audit_log_insert returns the inserted row's (id, created_at) via OUT parameters so the recorder threads them into the provenance INSERT on the same tx.

Why a deferred reservation, not Reserve→Record. The WrapMeteredProvider shape (Reserve known units → inner → Record same units) works for fixed-cost capabilities (one email = one unit). It breaks for AI because actual cost is unknown until the provider responds — LLM streaming reports input + output token counts at Stream.Close(), transcription reports actual audio_seconds at result time, embeddings reports tokens consumed in the response body. The deferred shape (BeginReservation(maxUnits) → inner → Settle(actuals) or Cancel()) reserves a ceiling upfront so a runaway-cost ceiling exists on the quota gate, then refunds the unused portion at settle so the org's monthly counter reflects real usage. The wrap layer auto-Cancels on inner failure so a panicking provider doesn't leak quota; the inner is responsible for Settle on success.

Per-direction SettleEntry for split pricing. LLM providers price input tokens and output tokens at different rates; the same usage event therefore writes two usage_records rows — one for input_tokens (cost = input_tokens × cost_per_input_unit_cents) and one for output_tokens (cost = output_tokens × cost_per_output_unit_cents). The Reservation.Settle API takes a list of SettleEntry so the inner can emit both in one call without re-opening the reservation. Other capabilities (embeddings, transcription) emit one entry; the contract handles both shapes uniformly.

Cost snapshots, not lookups. usage_records.cost_cents stores the integer cents the platform charged AT CALL TIME, computed by the AI capability impl just before Settle by reading the active row in ai_model_pricing_history. Foundation's aimodels.Repository.LookupActivePricing(ctx, modelID, at) is the canonical lookup. Storing the snapshot rather than recomputing at billing time means closed-period invoices reconstruct identically months later, even after pricing changes. The historical pricing chain in ai_model_pricing_history (partial unique on (model_id) WHERE effective_to IS NULL) keeps the trail.

The WrapMeteredAI helper. Mirror of WrapMeteredProvider minus the atomic Reserve+Record middleware — replaced with meterDeferred which Begins a Reservation, attaches it to ctx via capabilities.ReservationFromContext, runs inner, auto-Cancels on inner failure. The five AI capability seams (internal/core/ai/{llm,embeddings,transcription,vision,classification}) wrap their inner ops with this helper; foundation ships interface declarations + Fakes only, so the metering contract is locked before the first F-tier consumer wires real Anthropic / OpenAI / Voyage / Deepgram providers.

The CI guard (cmd/check-ai-models). Walks every package in the repo and asserts: any non-test file that imports one of the five AI capability seams MUST also import internal/core/domain/aimodels so it has the registry + pricing-snapshot path available. Foundation has zero such consumers (Fakes live in *_test.go); first F-tier AI feature is the first non-trivial run. Allow-list initially empty; entries require an architectural rationale.

Cross-references. P50 Capability ConventionWrapMeteredAI is the AI-shaped sibling. Foundation 1C.7 metering primitives (Reserve / Refund / Record) are the fixed-cost cousins. Reference: AI Models Registry — engineer-facing how-to.


P54: Pessimistic Edit Locks

Two staff editing the same record (a treatment plan, a form draft, an org settings page) don't lock-step over each other's work — the first to open a detail page acquires a TTL'd Redis lock; subsequent openers see a 409 with the holder principal id and a "Take over" affordance. Per-org Redis keys, per-principal identity, explicit takeover, audited. Generalises the appointment-slot Redis hold pattern as a reusable internal/core/locks/ primitive consumed by every detail-page surface.

The flow (locked, every detail page that opts in):

  1. Detail page mount. Frontend hook calls POST /v1/organizations/{id}/locks/{resource}/{resource_id}. On free → 201 with {holder, acquired_at, expires_at, self: true}. On held by another → 409 with {holder_principal_id, acquired_at} so the UI renders the read-only banner without a follow-up GET.
  2. Heartbeat. Hook PATCHes the same path every ~45s. Lua-scripted: extends TTL only if the current value's principal_id still matches the caller. Mismatch → 409 lock_lost (typical after a takeover by another staff member). The TTL itself is 120s — one dropped heartbeat is recoverable, two consecutive drops let the lock auto-release.
  3. Mutate. PATCH/POST/DELETE on the underlying resource is gated by locks.RequireLockHeld(svc, resourceType, paramName) middleware. Caller-doesn't-hold → 409 lock_held_by_other with current holder. The lock check is cheap (one Redis GET) and runs after the permission gate, before the handler.
  4. Release. Hook DELETEs on unmount. Idempotent — succeeds whether the caller held the lock, never held it, or lost it to a takeover. The TTL is the actual safety net; release is a UX optimisation. Best-effort navigator.sendBeacon on pagehide for tab-close cleanup.
  5. Takeover. Second user clicks "Take over" → POST acquire with {take_over: true}. The Redis key is overwritten, a LOCK_TAKEOVER audit row is emitted with prior + new holders. Original holder's next heartbeat fails with lock_lost; their UI surfaces a recovery banner ("Your edit lock was taken — your unsaved changes are below, copy what you need").

Redis key shape. lock:{org_id}:{resource_type}:{resource_id}. Org-scoped namespacing keeps cross-tenant blast radius zero by construction — even if a future change ever produces non-UUID IDs. Value is JSON {principal_id, acquired_at}. Default TTL 120s, default heartbeat cadence 45s (well below TTL with safety margin for one dropped heartbeat).

Why per-principal, not per-tab/session. Same staff member with two tabs open does NOT lock themselves out — both tabs heartbeat against the same key, both can release. The version column on the underlying table catches self-races between tabs (defense-in-depth). Per-tab tokens add a multi-holder data structure for marginal benefit; the version column is cheaper.

Why TTL'd Redis, not a row in Postgres. Lock state is high-churn (heartbeat every 45s) and ephemeral (auto-expires within 2 min of tab close). A Postgres row would burn write IO on every heartbeat and need a separate sweeper to clean abandoned locks. Redis is shared across the Core API fleet (P44 pgbouncer transaction-mode + P45 cache-aside compose with the lock store on the same Redis), atomic via Lua scripts, native TTL.

Why audit takeover, not acquire/release/heartbeat. Takeover is the security-significant event — a state transition where one staff member's edit session is forcibly ended by another. Acquire/release/heartbeat are operational metadata covered by the CLAUDE.md "operational-metadata bumps are exempt" rule (auditing them generates noise without forensic value — an attacker who compromised a session would generate identical bumps). Takeover audit row uses the LOCK_TAKEOVER action verb (extending the open-ended audit.Action const) with EntityType: "edit_lock" and a Before/After payload carrying both holders.

Why lock.write_blocked is logged but not audited. The mutate-guard middleware can't write an in-tx audit row — the request tx rolls back on 409, taking the audit row with it. Today the middleware logs at slog.Warn with caller_principal_id, holder_principal_id, org_id, resource_type, resource_id — enough forensic signal for log-aggregation queries. Promoting to a proper audit row waits on a recorder.RecordOutOfTx API that opens an isolated AdminPool tx (similar to the failure-path audit middleware's recordFailure).

Defense-in-depth: keep version columns. Locks prevent the COMMON case of concurrent edit. They can race in rare paths (Redis hiccup, expired-mid-save, takeover during in-flight request). Lockable tables also carry a version INTEGER NOT NULL DEFAULT 1 column that the repo checks on UPDATE — that's the correctness layer. Lock = UX, version = correctness. The version column is added in the same PR that adopts the lock for a given resource type (in-place edit of the original CREATE TABLE migration per CLAUDE.md "Migrations are editable pre-production").

Resource-type registry. Domains that opt in register at process boot:

go
// services/api/internal/core/domain/treatmentplans/locks.go
func init() {
    locks.RegisterResource(locks.ResourceDef{
        Type:        "treatment_plan",
        Permission:  principal.PermTreatmentPlansUpdate,
        Description: "Treatment plan editor (clinical, multi-tab)",
    })
}

The HTTP handler validates the URL resource segment against the registry — unknown types return 400 (prevents lock-bombing arbitrary strings as a storage growth attack). The associated permission code is checked on acquire; staff who can't edit a resource can't lock it either.

Mounting on a per-resource detail page. Two pieces wire the resource into the lock layer:

  1. Register the resource (above).
  2. Apply the mutate-guard middleware on PATCH/POST/DELETE routes:
go
r.Route("/treatment-plans/{id}", func(r chi.Router) {
    r.With(
        middleware.RequirePermission(principal.PermTreatmentPlansView),
    ).Get("/", h.HandleGet)

    r.With(
        middleware.RequirePermission(principal.PermTreatmentPlansUpdate),
        locks.RequireLockHeld(s.locksService, "treatment_plan", "id"),
    ).Patch("/", h.HandleUpdate)
})

Reads (GET/HEAD) do not need guarding — the lock is about edit serialization, not visibility.

Frontend integration. useEditLock from @workspace/ui/hooks/use-edit-lock is action-agnostic (no Next.js coupling); each app provides server-action wrappers around the api-client methods (acquireEditLock, heartbeatEditLock, releaseEditLock, getEditLock). The hook surfaces a typed status (acquiring | held_by_self | held_by_other | lost | error) the form switches on. <EditLockBanner /> from @workspace/ui/components/edit-lock-banner is the read-only / lost-recovery affordance.

The CI guard. Skipped at foundation. Rationale: the lock registry is advisory metadata, not a protocol invariant — a missing registration is an unmounted lock, which is observable at integration time. The closest precedent guard (cmd/check-inbound-webhooks) enforces a 5-step protocol invariant where missing steps cause silent dedup bugs; locks have no comparable invariant. Reconsider if a Layer 2+ consumer (metering, observability) depends on the registry being complete.

Cross-references. P10 Audit LoggingLOCK_TAKEOVER rows. P44 Connection Pooling via pgbouncer — same Redis instance hosts the lock store + rate limit + cache-aside. P45 Redis-Backed Query Cache — composes on the same Redis. P47 URL ≡ Scope Guard — every per-org route group wrapping the lock handler enforces this. Reference: Edit Locks Guide — engineer-facing how-to for adopting the primitive in F-tier.

P55: Console Break-Glass Primitive (Per-Org Permission OR Elevation)

The Console acts on tenant data on behalf of platform staff (superadmin, support_engineer). Every Console-side write is the controller-vs-processor failure mode the break-glass pattern exists to prevent: clinic = controller, platform = processor; cross-tenant writes without elevation are joint controllership. Per CLAUDE.md hard rule: identifiable cross-tenant access goes through the break-glass pattern (per-org scope, time-bound, justified, always-on clinic notification, audited).

The same routes serve both callers (clinic admins from the Clinic app, platform staff from Console) — duplicating routes per caller doubles the surface and drifts. P55 lands a single primitive that admits both paths on one route and a Console-side hook that all elevation-aware UI consumes.

The five pieces:

  1. middleware.RequirePerOrgPermissionOrBreakGlass(permission, scope, svc, paramName) — chi middleware that splits by principal type. Tenant principals (PlatformRoles empty) pass via Subject.HasPermission(permission) — owner short-circuit honoured. Platform principals require an active breakglass.Session at the named scope; the middleware binds the audit GUC so every row written downstream carries break_glass_id. Returns 403 break_glass_required when no session exists, 410 break_glass_expired when a session lapsed (so the frontend knows to prompt re-elevation).

  2. breakglass.ScopeOrgManagement — broad scope covering staff, member, role, and org-settings writes. One elevation session covers a related task without re-prompting per click; the audit_log.action row records the specific action so the session's blast radius is reconstructable. Held by support_engineer for routine clinic-recovery work; superadmin holds it unconditionally.

  3. <BreakGlassSessionProvider sessions={...} orgId={id}> + useBreakGlassSession(scope) — Console layout fetches the calling principal's active sessions for the surrounding org once via getActiveBreakGlassSessionsForOrg(orgId) (server-only, React.cache-shared between layout + descendant pages), threads them through a context, and exposes per-scope state to descendant client components. P48-compliant via useServerSyncedState — the prop reseeds when a server action invalidates the layout.

  4. <RequireBreakGlass scope={...} orgName={...}> — client wrapper that conditionally renders children when an active session exists at the named scope, surfaces the elevation prompt otherwise. Defense-in-depth alongside the backend gate; UX win of "don't show the action button when the action will 403."

  5. withBreakGlass(fn) server-action wrapper — surfaces backend break_glass_required / break_glass_expired errors as a typed sentinel { needsElevation: { code } } so server actions return them as part of useActionState for the client UI to react to (open the modal, retry on session) without an unhandled exception.

The flow (locked):

  1. Layout fetch. apps/console/app/(dashboard)/clinics/[id]/layout.tsx calls getActiveBreakGlassSessionsForOrg(id) and renders <BreakGlassSessionProvider sessions={...}>. The fetch is React.cache-shared, so descendant server components can call the same helper without an extra round-trip.
  2. Action UI. A Console component that triggers a write either wraps itself in <RequireBreakGlass scope="org_management" orgName={...}> (fallback shows <ElevationModal>) or calls the server action via useActionState and reads needsElevation from the result to surface the modal.
  3. Server action. Wraps the api-client call in withBreakGlass(() => api.someWrite(...)). The wrapper distinguishes elevation errors from validation errors and returns the typed sentinel; the action's existing error path still handles 4xx for validation/conflicts.
  4. Backend gate. Same route hosts the Clinic-app caller (per-org permission) and the Console caller (elevation). The middleware admits both or 403/410s with a code the Console UI knows.
  5. Audit linkage. Every audit row written inside the elevated handler picks up break_glass_id automatically via the GUC the middleware sets at session match. No call-site changes — the existing audit_log_insert(...) redefinition in migration 000011 reads the GUC.

Where to mount the gate.

go
r.Route("/v1/organizations/{id}", func(r chi.Router) {
    r.Use(middleware.RequireURLOrgMatchesScope("id")) // P47

    // Read — always-on; tenant + platform staff both pass via permission.
    r.With(middleware.RequirePermission(principal.PermOrganizationsManageMembers)).
        Get("/members", h.HandleListMembers)

    // Write — tenant via permission, platform via elevation.
    r.With(middleware.RequirePerOrgPermissionOrBreakGlass(
        principal.PermOrganizationsManageMembers,
        breakglass.ScopeOrgManagement,
        s.breakGlassService, "id",
    )).Post("/members", h.HandleAddMember)
})

What stays always-on. Reads (list members, list staff invitations, view org details). Operational support requires visibility; staff data isn't PHI; the controller-vs-processor risk is on writes. Mirrors the patient surfaces — aggregate view always-on, identifiable lists / writes elevation-gated.

What's gated under org_management. Today: PATCH /organizations/{id}, POST/DELETE /members, POST /staff-invitations. Sibling targets that future commits land: org settings, domains, designations, webhooks, integrations, billing.

Why a broad scope, not per-resource. Finer-grain audit, but the elevation modal pops every two clicks while a superadmin is doing related ops. The audit row already records the specific action — the scope only needs to be coarse enough to bound the session's blast radius.

Why platform principals don't bypass even when they're org members. A superadmin who happens to be a member of OrgA still goes through elevation when calling Console-mounted org_management routes against OrgA. The point is that every cross-tenant write is linked to an open session — silently bypassing on incidental membership defeats the audit trail. Mirrors RequireBreakGlass (no superadmin bypass there either).

Cross-references. P10 Audit Loggingbreak_glass_id linkage on every row written inside an elevated handler. P42 Server-Side Response Caching — break-glass session lookup is intentionally untagged (timing is load-bearing; expired sessions must not cache as active). P47 URL ≡ Scope Guard — mounted before RequirePerOrgPermissionOrBreakGlass so the URL {paramName} is validated to match the calling principal's scope. P48 Server Data Flow to Client ComponentsBreakGlassSessionProvider uses useServerSyncedState to honour layout invalidation. Decisions: Why one Console-side break-glass primitive — the ADR.


Media Pipelines

P56: Exercise Video Composition Pipeline

What. A three-stage pipeline that turns raw filming primitives into patient-facing exercise videos:

   ┌──────────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐
   │  AWS S3              │   │  exercise-composer   │   │  Bunny Stream        │
   │  (source of truth)   │   │  (Go service)        │   │  (CDN delivery)      │
   │                      │   │                      │   │                      │
   │  Raw assets per      │──▶│  Download bundle,    │──▶│  Transcode to HLS,   │
   │  exercise: silent    │   │  ffmpeg-bake one     │   │  adaptive bitrates,  │
   │  rep blocks +        │   │  MP4 per prescription│   │  serve via player +  │
   │  per-side VO mp3s +  │   │  + variant pool,     │   │  HLS URL.            │
   │  intro/pauza/outro   │   │  upload to Bunny.    │   │                      │
   └──────────────────────┘   └──────────────────────┘   └──────────────────────┘
        read-only                  stateless                bunny_video_id
        per-env bucket             POST /v1/compose         cached in DB

Source assets are owned by the platform's filming/audio teams; rendered MP4s are owned by Bunny. The composer is stateless — its job is to translate one (exercise, prescription, language) tuple into one MP4 + upload it.

Why this shape — three concerns, three stages.

  1. Source of truth lives in S3, not in the CDN. Raw filming primitives need our control: versioning, encryption, IAM, lifecycle. Bunny Stream is for delivery; it's not where the assets live forever. If Bunny credentials rotate, the library changes, or the CDN provider swaps, the renders re-bake from S3 unchanged.
  2. Composition is heavy CPU + ffmpeg, not API work. Keeping the bake step out of the Core API process avoids blocking request workers; bundling ffmpeg in a separate container avoids bloating the API image. Stateless service is straightforward to scale on queue depth.
  3. Delivery is HLS, not MP4. Bunny does transcoding-to-adaptive-bitrate automatically; we hand off an MP4 and get an HLS ladder for free. Building that ourselves means encoding pipelines + multiple S3 lifecycle policies + HLS playlist generation — none of which is core to the platform.

The compositional model. Each rendered MP4 is one exercise at one prescription:

intro → set 1 → pauza → set 2 → pauza → … → outro

Each set is one side (left | right) and one rep count. Reps are always multiples of 5, capped at 20 (the audio master's count ceiling). Inside a set, the rep video plays N/5 times silently and the side-specific VO is laid over with breathing mixed under at 0.35 gain. Counts reset per set by re-using the baked set clip — same baked set serves every occurrence of (side, reps) in the prescription.

Variant model. Each slot has 3 variants delivered by filming/audio teams. Each render picks one variant per slot (random or seeded). Anti-repetitiveness across different prescriptions of the same exercise. Cache key includes the deterministic seed (when applied) so identical inputs re-render identically.

Cache key. (exercise, recipe_hash, language)bunny_video_id. Stored in the Core API DB (exercise_renders table). Asset-version-aware: when the filming team uploads a new variant, exercises.asset_version bumps and stale renders re-fetch.

Two exercise kinds. The exercises table has a kind column with two values: reps_based (composer pipeline as described above) and duration_based (a single pre-baked MP4 uploaded directly to Bunny — no primitives, no composer, no recipe). Both flow through the same exercise_renders cache; duration_based exercises just have exactly one row per language with recipe_hash='_imported' and recipe=NULL. See features/exercise-library/composition.md for the full kind comparison.

Bunny collections. One collection per exercise slug, used by both kinds. The composer auto-creates the collection on first render; legacy/duration_based imports upload directly into the same per-slug collection. Bunny library is one per env (dev / staging / production).

Required infrastructure.

  • S3 bucket per env (restartix-exercise-assets-{dev,staging,production}), Terraform-managed via the storage-s3 module, eu-central-1, SSE-S3, public access blocked, versioning on (staging/prod). Composer's Fargate task role is read-only on the bucket by design — outputs go to Bunny, not back to S3.
  • Bunny Stream library per env, manually provisioned in the Bunny dashboard (no Terraform Bunny provider). Credentials live in env vars for v1; production migration target is the Cat A curated providers resolver (capability=video_storage, provider_name=bunny, bootstrap-seeded into platform_service_providers).
  • The composer service at services/exercise-composer/. ECS Fargate task, separate ECR image (Go binary + ffmpeg + ffprobe). HTTP API: POST /v1/compose, GET /healthz.
  • exercise_renders cache table in the Core API DB (deferred; lands with F9.1 integration).

Where it applies.

  • F9.1 Exercise Library — the asset bundle layout and recipe contract.
  • F9.2 Treatment Plans — calls the cached render endpoint per prescribed exercise; doesn't see the composer directly.
  • F9.3 Patient Enrollment + Execution — patient app fetches HLS URLs from the cache, plays via Bunny player.
  • F-tier "guided sessions", any future feature that plays an exercise at a specific dose.

Foundation status. Composer service shipped (sandbox at experiments/exercise-composer/ proves the algorithm; production service at services/exercise-composer/ is the deployable). End-to-end tested against dev S3 + dev Bunny. The Core-API-side exercise_renders cache + queue wrapper is pending F9.1.

Cross-references. Decisions: Why composer is a separate Go service. Decisions: Why Bunny Stream + AWS S3 split. Decisions: Why caching by (exercise, prescription) only. features/exercise-library/composition.md — the compositional spec. reference/exercise-content-pipeline.md — operational details (bucket layout, Bunny setup, upload workflow).


How to Use This Catalog

When designing a new feature:

  1. Read its spec in apps/docs/features/{feature}/
  2. List the patterns it touches (most features touch ≥ 5)
  3. For each pattern: check this doc — does the foundation infrastructure exist?
  4. If any required pattern is missing, the feature is blocked on landing that pattern first
  5. Cross-reference dependency-map.md to see what else is blocked on the same pattern

A feature is ready to build when every pattern it depends on is implemented. No exceptions, no shortcuts — that's the foundation discipline rule from CLAUDE.md.