Skip to content

Session Management & RLS Setup

Stale post Layer-1.24 Go cleanup — pending holistic rewrite. Type/middleware/column names below are pre-cleanup: PrincipalContext is now principal.Subject, ClerkAuth is now middleware.Authenticate(verifier, resolver, loader), humanRepo.LoadPermissions is now principal.SubjectLoader.LoadPermissions, humans.clerk_user_id is now humans.provider_subject_id. The flow shape (verify → resolve → load Subject → attach RLS tx) is unchanged; only names and packages moved. Verify against the code before quoting. Authoritative pointer: decisions.md → Auth chain shape.

Describes the request-scoped session model that's actually shipped today (Layer 0 + Layer 1.1). Identifiers are UUIDv7 throughout — no int64 IDs anywhere in this path. Authorization is per-org permissions, not role-string compares.

Overview

Every authenticated request establishes a session context by:

  1. Verifying the Clerk session token (ClerkAuth middleware).
  2. Loading the internal humans row (joined with the matching principals row) from the database via the AdminPool (bypasses RLS) and JIT-provisioning both rows in a single transaction on first sign-in.
  3. Loading the principal's memberships and platform-role grants — IsSuperadmin, Memberships[] — onto a PrincipalContext stored in the request context.
  4. Resolving the current org for this request (OrganizationContext middleware), validating membership, then loading the per-org permission set from role_permissions for the role the principal holds in that org.
  5. Routing to the correct connection pool: AdminPool (owner role, bypasses RLS) for superadmins; AppPool (restricted role, RLS enforced) for everyone else.
  6. Acquiring a connection, opening an explicit pgx.Tx, and binding transaction-scoped Postgres session variables to it via the set_app_*_context SECURITY DEFINER wrappers so RLS policies can scope queries to the current principal / actor type / org / role.

The same pgx.Tx is reused by every repository call in the request — repositories pull it out via database.TxFromContext(ctx). The transaction commits at the end of the request (or rolls back on a 5xx / panic), at which point Postgres wipes the session variables automatically — there is no clear-state function callable from AppPool, so a returning pooled connection cannot leak authorization state to the next request.

The result: tenant isolation is enforced by Postgres RLS, not application filtering, and authorization decisions are made against a permission code (organizations.update, audit_log.view_org, …) — never a hardcoded role-string comparison.

Session Variables

Four Postgres session variables drive RLS. They're typed via the current_app_*() helpers in migrations/core/000001_init.up.sql.

VariableType (after NULLIF/cast)Set byRead by
app.current_principal_idUUIDOrganizationContextcurrent_app_principal_id() — used in self-scoped RLS
app.current_actor_typeTEXTOrganizationContextcurrent_app_principal_type() — used by policies that branch on 'human' | 'agent' | 'service_account' | 'system'
app.current_org_idUUIDOrganizationContextcurrent_app_org_id() — used by every tenant-scoped policy
app.current_roleTEXTOrganizationContextcurrent_app_role() — kept for diagnostic logging; no longer read by any RLS policy (see "Permission-based, never role-string" below)

They're set inside the request's transaction with set_config(..., true) (the third argument) so they are transaction-scoped — Postgres clears them at COMMIT/ROLLBACK, so the next caller using this pooled connection cannot inherit a previous request's identity. Combined with the lock-once gate inside each set_app_*_context wrapper (refuses to re-bind principal_id to a different value within the same transaction), this is the load-bearing defense against an SQLi attacker on AppPool spoofing into another tenant's identity.

uuid.Nil is rendered as the empty string when set; the helpers use NULLIF so the SQL side sees NULL. Public /v1/public/* endpoints (e.g. GET /v1/public/organizations/resolve) run on AdminPool — owner role, RLS bypassed — with the handler doing a single-row equality match on slug or verified domain. AppPool with no session vars intentionally sees zero rows on every authenticated table; there are no public RLS carve-outs anymore.

Middleware Stack

The pipeline is wired in services/api/internal/core/server/routes.go for the /v1 group: ClerkAuth → OrganizationContext → ActivityTracker → audit.Middleware.

1. ClerkAuth (internal/core/middleware/auth.go)

Verifies the bearer JWT, JIT-provisions the principals + humans rows if this is the first sign-in, and assembles a PrincipalContext.

go
// internal/core/auth/context.go (current shape, simplified)

type Membership struct {
    OrganizationID uuid.UUID
    RoleID         uuid.UUID
    RoleCode       string
}

type PrincipalContext struct {
    PrincipalID           uuid.UUID            // FK to principals.id
    ActorType             string               // 'human' | 'agent' | 'service_account' | 'system'
    ClerkUserID           string               // populated when ActorType = 'human'
    Email                 string
    IsSuperadmin          bool                 // derived from platform_roles (humans only)
    PlatformRoles         []string             // ["superadmin"] when present
    Memberships           []Membership         // every (org, role) pair
    CurrentOrganizationID uuid.UUID            // populated by OrganizationContext
    CurrentRoleCode       string               // role held in CurrentOrganizationID
    Permissions           map[string]struct{}  // codes for that role; nil for superadmin
}

func (p *PrincipalContext) HasPermission(code string) bool {
    if p.IsSuperadmin {
        return true
    }
    _, ok := p.Permissions[code]
    return ok
}

What it does end-to-end:

  1. Extract the Authorization: Bearer <jwt> header. Missing / malformed → 401.
  2. Verify the token via the Clerk SDK. Bad signature, expired, blocked session → 401.
  3. Look up humans.clerk_user_id. If absent, run provisionHuman (insert a principals row + matching humans row in one transaction). Patient onboarding (which provisions the patient_profiles + patients rows for the org indicated by X-Organization-ID) runs separately via the /v1/portal/onboard flow — see patients/onboarding.md. Patients are not memberships post-1.26.
  4. If humans.blocked = TRUE, return 403 — and audit-log the failed request.
  5. Load the principal's platform_roles and organization_memberships rows.
  6. Store the assembled PrincipalContext on r.Context() via auth.ContextWithPrincipal. Retrieve it downstream with auth.PrincipalFromContext.

The lookup uses the AdminPool (bypasses RLS) because the humans / principals rows may not yet have an organization scope when the request arrives. Permissions are not loaded here — they depend on which org the request is scoped to, and that's resolved in the next middleware.

2. OrganizationContext (internal/core/middleware/organization.go)

Resolves the current org, validates membership, loads permissions for that org, picks the pool, acquires a connection, and sets the RLS session variables.

go
// services/api/internal/core/middleware/organization.go (signature)

func OrganizationContext(
    adminPool, appPool *pgxpool.Pool,
    humanRepo *human.Repository,    // needed for LoadPermissions(principalID, orgID)
) func(http.Handler) http.Handler

The third argument matters: permissions are loaded per request, scoped to the current org, because the same principal can hold different roles in different orgs (admin at clinic A, customer_support at clinic B). They cannot be cached on PrincipalContext ahead of time without binding the cache to (principal, org), which we don't bother with — Postgres serves the lookup quickly enough.

Order of operations for each request:

  1. Pull the PrincipalContext from the request context. Missing → 401 (ClerkAuth should have run first).
  2. Resolve the org ID, in this order: X-Organization-ID header (set by the frontend proxy.ts from the resolved subdomain or custom domain) → humans.current_organization_id if the principal is still a member of it → first membership → uuid.Nil (safe default).
  3. Validate membership. A non-superadmin who passes a real org ID they aren't a member of gets 403. uuid.Nil is a valid pass-through (used by patients on first sign-up before auto-link runs, or by superadmins operating without org context).
  4. Load permissions for (principalID, orgID) via humanRepo.LoadPermissions, which joins organization_memberships → roles → role_permissions → permissions. The result populates PrincipalContext.CurrentRoleCode and PrincipalContext.Permissions. Skipped for superadmins (HasPermission short-circuits to true for them).
  5. Pick the pool: AdminPool for IsSuperadmin, AppPool otherwise.
  6. Acquire a connection, BeginTx, defer the cleanup that decides commit-vs-rollback from the response status (commit on < 500, rollback on 5xx or panic).
  7. Set the RLS session variables by calling one of the SECURITY DEFINER wrappers (set_app_staff_context, set_app_patient_context, or set_app_principal) inside the transaction. The wrapper validates the (principal, org, role) tuple against real organization_memberships / patients rows, normalizes actor_type from the principal record, refuses to attach a session for a blocked / soft-deleted / revoked principal, and binds the four GUCs via set_config(..., true).
  8. Stash the transaction on the context via database.ContextWithTx. Repositories use this same pgx.Tx for the rest of the request via database.TxFromContext(ctx).
  9. Tell the audit middleware which org this request is scoped to (audit.SetCurrentOrg(ctx, orgID)) so 403/500 failure-path rows downstream are attributed correctly.

3. ActivityTracker (internal/core/middleware/activity.go)

Bumps organization_memberships.last_used_at for the (principal, org) of the current request. Throttled by an in-process 60-second cache so most requests are zero-DB-cost. See reference/activity-tracking.md.

4. audit.Middleware (internal/core/audit/middleware.go)

Captures request metadata, lets the handler run, then writes one audit row for failure paths it owns: every 5xx, every 403, and 401s only when the request carried a Bearer token (so unauthenticated polling doesn't pollute the log). Successful mutations are audited from inside the handler — the recorder writes a single row per logical event, with the diff in changes and sensitive keys redacted (P11).

Repository Pattern

Repositories don't reach for the pool — they pull the request-scoped pgx.Tx from the context so RLS session variables are in scope:

go
// internal/core/database/postgres.go

func TxFromContext(ctx context.Context) pgx.Tx { /* ... */ }
go
// Pattern used by every repo (e.g. services/api/internal/core/domain/organization/repository.go)

func (r *Repository) FindByID(ctx context.Context, id uuid.UUID) (*Organization, error) {
    tx := database.TxFromContext(ctx)
    if tx == nil {
        // Background path — no request context. Caller is responsible for
        // beginning a transaction and setting session variables.
        return nil, errors.New("no transaction in context")
    }

    var org Organization
    err := tx.QueryRow(ctx, `
        SELECT id, name, slug, /* ... */
        FROM organizations
        WHERE id = $1
    `, id).Scan(&org.ID, &org.Name, &org.Slug /* ... */)
    return &org, err
}

RLS does the org filtering automatically — even if a future bug ever hands a different org's ID to the query, the policy id = current_app_org_id() means a non-superadmin transaction sees zero rows back.

Atomicity is a free side benefit: every mutation inside one request runs in the same transaction, so a handler that does INSERT A → UPDATE B → INSERT C and errors on the third leaves no partial state behind. The audit log is written through Recorder on AdminPool and is therefore independent of this transaction — failed requests still leave their 401/403/500 audit row even when the request transaction rolls back.

RLS Helper Functions

Defined in migrations/core/000001_init.up.sql. They return UUID / TEXT typed values, with NULLIF so uuid.Nil reaches Postgres as SQL NULL:

sql
CREATE OR REPLACE FUNCTION current_app_principal_id() RETURNS UUID AS $$
    SELECT NULLIF(current_setting('app.current_principal_id', true), '')::UUID;
$$ LANGUAGE SQL STABLE;

CREATE OR REPLACE FUNCTION current_app_principal_type() RETURNS TEXT AS $$
    SELECT NULLIF(current_setting('app.current_actor_type', true), '');
$$ LANGUAGE SQL STABLE;

CREATE OR REPLACE FUNCTION current_app_org_id() RETURNS UUID AS $$
    SELECT NULLIF(current_setting('app.current_org_id', true), '')::UUID;
$$ LANGUAGE SQL STABLE;

CREATE OR REPLACE FUNCTION current_app_role() RETURNS TEXT AS $$
    SELECT NULLIF(current_setting('app.current_role', true), '');
$$ LANGUAGE SQL STABLE;

current_app_role() is no longer read by any RLS policy — the previous bare-staff-signal current_app_role() <> '' qualifiers were replaced by permission checks (current_app_has_permission('organizations', 'view_directory') for staff-directory tables, current_app_has_permission('patients', 'view') for the patient roster) so a custom no-permission per-org role doesn't over-grant visibility. app.current_role is still set by the wrappers for diagnostic logging. Tenant policies branch on current_app_has_permission(resource, action) (defined in 000002_tenancy_rbac.up.sql), which joins through organization_memberships → role_permissions → permissions to answer "does the current (principal, org) hold this capability?".

There is intentionally no is_superadmin() helper. Superadmins bypass RLS via the AdminPool's owner role, not via a policy clause.

Permission Loading

humanRepo.LoadPermissions(ctx, principalID, orgID) runs:

sql
SELECT p.code
FROM organization_memberships po
JOIN role_permissions rp ON rp.role_id = po.role_id
JOIN permissions p       ON p.code     = rp.permission_code
WHERE po.principal_id    = $1
  AND po.organization_id = $2;

The result becomes PrincipalContext.Permissions. RequirePermission(code) middleware in internal/core/middleware/rbac.go then checks membership in this set and returns 403 if absent. RequireSuperadmin() checks the platform-level IsSuperadmin flag.

Don't compare PrincipalContext.CurrentRoleCode against a literal — it's exposed for diagnostics, not authorization. Every authorization decision goes through HasPermission / RequirePermission. See reference/rbac-permissions.md for the catalog.

Connection Pool Management

Each request holds its connection (and its single pgx.Tx) for its full duration so the RLS GUCs stay in scope. Pool sizing must therefore match peak concurrency:

bash
DB_POOL_MIN=5
DB_POOL_MAX=25     # raise if you see waits at the pool

Critical metric: pgxpool.Stats().AcquireWaitCount. If it ever leaves zero under sustained load, raise DB_POOL_MAX.

Scaling note: because the session GUCs are bound to the request's transaction (not the pooled connection), pgbouncer in transaction-pooling mode is a drop-in fit when concurrent connections push past ~1k. Each client request maps to one pgbouncer transaction lease — the unit our authorization context lives in. See Scaling Architecture → PgBouncer.

Security Guarantees

  • Tenant isolation is enforced by Postgres, not by handler code. RLS rewrites the query before execution; even if a handler forgets a WHERE org_id = ?, the policy adds organization_id = current_app_org_id().
  • Superadmin bypass is at the connection layer, not in policies. A superadmin request runs on the owner pool, where Postgres does not enforce RLS at all. No policy uses an is_superadmin() clause.
  • Session variables are transaction-scoped. Postgres clears them at COMMIT/ROLLBACK, so a connection returned to the pool cannot leak state to the next caller. There is no clear_app_context() callable from AppPool — an SQLi attacker cannot reset the lock-once gate to spoof a different principal because the function doesn't exist; within the same transaction, the lock-once check on the wrappers blocks rebind.
  • Public endpoints run on AdminPool, not AppPool. The hostname-resolve endpoint (GET /v1/public/organizations/resolve) runs as the owner role with a single-row equality match on slug or verified domain — RLS is bypassed for that one path. AppPool with no session variables intentionally sees zero rows on every authenticated table.
  • Authorization is permission-based, not role-string-based. Custom per-org roles can hold a subset of permissions and the same code paths gate them — no application changes needed.
  • Actor-type-aware policies can read current_app_principal_type() when a policy needs to differentiate humans from agents / service accounts (e.g., AI-driven writes that should record provenance).

Fail Modes

FailureStatusWhere it's detected
Missing Authorization header401ClerkAuth
Invalid / expired Clerk JWT401ClerkAuth
humans.blocked = TRUE403 (audited via audit.Middleware)ClerkAuth
X-Organization-ID for an org the principal is not a member of403OrganizationContext
humanRepo.LoadPermissions fails500 (generic envelope, original error logged)OrganizationContext
Pool acquire fails500OrganizationContext
set_config(...) fails500OrganizationContext
Caller lacks the gated permission403RequirePermission

Testing RLS

The Layer 1.2 harness at services/api/internal/test/rlstest/ runs against a real Postgres in testcontainers, seeds two orgs and three humans (admin of A, patient of B, superadmin), and exercises the policies directly. Run it with:

bash
make test-integration

Manual psql verification still works — wrap the work in an explicit transaction and call the production wrapper, exactly like middleware does:

sql
-- Run as the restricted role to actually observe RLS.
SET ROLE restartix_app;

BEGIN;
  SELECT set_app_staff_context('<principal-uuid>', '<org-uuid>', 'admin');

  SELECT count(*) FROM organizations;                   -- 1 (the current org only)
  SELECT count(*) FROM organizations WHERE id <> '<org-uuid>'; -- 0
COMMIT;  -- Postgres wipes the GUCs here; a subsequent BEGIN starts clean.

RESET ROLE;

Calling set_config directly at the psql prompt also works for ad-hoc probing, but the wrapper exercises the real validation path (membership existence, principal liveness, lock-once) and is what production hits — preferring it keeps the manual probe and the production codepath in sync.