AI Agent Runtime & Cost Economics

Implementation guidance for when AI agent features ship. Captures the runtime architecture, retrieval pattern, and cost economics decided during foundation design — pre-implementation, but committed enough to constrain the first feature that builds an agent.

This doc complements product/ai-agents.md (the platform-level commitment to AI agents as first-class actors) and decisions.md → Why principals as the root identity (the actor-model design). Foundation has shipped the substrate (principals + agents sibling table, audit_ai_provenance, RLS, pgvector); features that build agents follow the patterns here.

Status

Pre-implementation. No agent runtime exists yet. The first agent feature (likely patient-progress monitoring) is Layer 5+ — well after Foundation closes and after the clinical features (treatment plans, exercises, telerehab, telemetry rollups) it depends on. This doc fixes the design choices in advance so the first feature doesn't reinvent them.

Foundation pieces already shipped:

Principal model with agents as a sibling table (1B.1)
audit_ai_provenance with model_version, inputs_hash, confidence (1A.1, 1A.15 — partitioned)
RLS infrastructure (every actor type gets scoped the same way; agents are not privileged)
pgvector (1A.16 — embedding columns ready when needed)
Internal event bus (1A.9 — agent runs are event-driven, not in HTTP handlers)

Cost economics

The wrong framing: LLM-everything per patient

A naïve agent design runs an LLM call for every active patient every day. At ~$0.01–$0.10 per call (model dependent, prompt size dependent), 5k active patients × 30 days = $1.5k–$15k/month. This is the cost ceiling. It's also the wrong way to architect the feature.

The right framing: 95% deterministic, 5% LLM

Most "is this patient on track?" questions are SQL queries:

Did the patient complete <50% of assigned exercises this week? → SQL, $0.001
Has the patient missed 2+ sessions in a row? → SQL, $0.001
Is pose accuracy below threshold X this week? → SQL, $0.001
Is the patient on a treatment plan that's behind schedule? → SQL, $0.001

These are the routine triage signals. They run continuously, deterministically, cheaply, with full RLS scoping. An LLM adds nothing here — every clinic wants the same thresholds expressed the same way.

LLM reasoning is reserved for things rules can't do:

Synthesis. "Summarise the last 4 weeks of notes + telemetry into a 1-paragraph briefing the specialist reads before the next appointment."
Comparative reasoning. "This patient's adherence dropped — explain what changed in clinical context and suggest 2–3 interventions for the specialist to consider."
Personalised generation. Patient-facing messages that read like a human wrote them, scoped to that patient's specific progress.
Anomaly explanation. "Multiple metrics shifted at once — what story do they tell?"

Operating cost

If 10–20% of active patients trigger LLM attention per week (the rest are handled by deterministic rules), real spend lands at $0.30–$0.60/patient/month, not $3.

Reference points to judge against

Alternative	Per-patient cost	What you get
LLM agent (smart triggering)	$0.30–$0.60/mo	Synthesis, generation, comparative reasoning on flagged cases
LLM agent (run-on-everything)	$3/mo	Same, applied indiscriminately — wasteful
Junior triage specialist hand-reviewing 500 patients	~€3/mo (salary only)	Doesn't scale past one specialist; can't read deeply
Rule-only system	$0/mo to run	Cheap to run, expensive to develop, brittle (every new pattern needs a rule)
Specialist time saved (5 min/patient/week @ €15–€30/h)	€1.25–€2.50/mo equivalent	Break-even argument for the agent feature

The platform's per-patient revenue from clinics is enough to absorb $0.30–$0.60/mo without a hit to margin. $3/mo is marginal. The architecture decision is "trigger selectively, not constantly."

Cost levers (apply all five)

These compound — combining them takes the $15k/month upper bound to $1k–$3k/month for the same coverage:

Anthropic prompt caching. Cache the clinic-level system prompt + clinical protocols + agent's identity frame. ~90% cheaper on cached input tokens, 5-min TTL. Biggest single lever — input tokens are ~85% of LLM cost.
Selective triggering. Run on flagged patients only. 5–10× reduction.
Smaller models for routine work. Haiku for "summarise this week's activity"; Opus only for "analyse concerning pattern". 10–20× cost difference per call.
Batch API for scheduled summaries. Anthropic batch API is 50% off; latency doesn't matter for nightly briefings.
Compress patient history. Feed weekly rollups, not raw events. Trade a small loss of specificity for a large drop in input tokens.

Retrieval architecture

Agent reads Postgres aggregates, not raw ingest

Patient exercise-engagement data and pose-tracking landmarks ingest through the Layer 2 Telemetry pipeline. Per-rep clinical aggregates land in Postgres (pose_session_metrics, pose_rep_metrics, media_session_metrics, media_buffering_events — see /telemetry/index.md and patterns.md P32). Per-session replay landmarks go to S3 as gzipped binary blobs.

For an AI agent, the clinical aggregates in Postgres are the right surface for two reasons:

Right query shape. The agent's natural query is "give me this patient's last 4 weeks of session summaries" — a point lookup with a small range scan, Postgres-shaped. The aggregates table is already that shape.
Right privacy boundary. PG aggregates are RLS-scoped, audited, and classified — the same privacy envelope clinical data already lives in. The agent doesn't introduce new exposure.

S3 replay blobs are NOT for agent consumption — they're for visual replay in the Clinic app. An agent that needs frame-level inference would re-process landmarks via a model and produce new structured features back to Postgres; not stream from S3 on a per-conversation basis.

The rollup pattern

For surfaces an agent reads regularly, the platform rolls up the per-rep aggregates into per-patient weekly summaries written into Postgres:

sql

CREATE TABLE patient_engagement_weekly (
    patient_id              UUID NOT NULL REFERENCES patients(id),
    organization_id         UUID NOT NULL REFERENCES organizations(id),
    week_start              DATE NOT NULL,

    sessions_completed      INT NOT NULL DEFAULT 0,
    sessions_skipped        INT NOT NULL DEFAULT 0,
    exercise_completion_rate NUMERIC(4,3),
    avg_session_duration_seconds INT,
    pose_accuracy_avg       NUMERIC(4,3),
    pose_accuracy_trend     TEXT, -- 'improving' | 'flat' | 'declining'
    bandwidth_quality_avg   NUMERIC(4,3),
    adherence_score         NUMERIC(4,3),

    created_at              TIMESTAMPTZ NOT NULL DEFAULT NOW(),

    PRIMARY KEY (patient_id, week_start)
);

This table:

Is the agent's input. Every retrieval is a point lookup by patient_id + a week_start >= NOW() - interval '4 weeks' predicate.
Is RLS-scoped. Standard organization_id = current_app_org_id() policy. The agent's principal sees exactly its assigned scope.
Is re-identified because the rollup writer is the only path that knows both the hashed and unhashed sides — the de-identification boundary stays at ClickHouse, the agent stays on the relational side.
Is an event table per P41. Append-only, time-ordered, multi-year retention — partition monthly on week_start when the table lands.
Is joinable to treatment plans, specialist notes, forms, anything else the agent needs in one query.

ClickHouse is for human-facing dashboards and ad-hoc analytics, not agent input.

Why pgvector, not Qdrant

The reflexive instinct when adding AI features is "we need a vector database." For this platform, that instinct is wrong. Vector similarity search is not where the cost is — it's nearly free at any reasonable scale (pgvector with HNSW handles 10M+ vectors with sub-100ms queries). The cost lives in the LLM API call, not the retrieval step.

	pgvector	Qdrant
LLM API cost	identical	identical
Vector search latency at <10M vectors	sub-100ms	sub-50ms
Vector search latency at 100M+ vectors	starts to slow	still fast
Operational overhead	none — same Postgres	another DB to operate, monitor, back up
RLS	enforced natively	re-implemented in app code
Round trips	1 (Postgres)	2 (Qdrant → IDs → Postgres)
Joinable to relational data	yes, in one query	no, app-side join
Cost	free (existing Postgres)	~$50–100/mo Qdrant Cloud minimum, or self-host ops

For this platform's stated scale (small-to-medium clinics, single-digit locations, growing 5–10× from launch), pgvector is the correct choice. When to revisit: vector count > ~50M, or sub-millisecond p99 retrieval becomes a hard requirement — neither is on the platform's roadmap.

Runtime pattern

Never synchronous in HTTP handlers

Agent runs go through the internal event bus (architecture/events.md) into worker queues, never blocking a user request. Two trigger modes:

Scheduled (the common case). Nightly worker picks up active patients matching trigger criteria, computes a briefing, writes it to a patient_progress_summaries table + an audit_ai_provenance row. The clinic's morning view shows yesterday's briefings.
Event-triggered (the exception path). Telemetry detects a sudden adherence drop or missed-session-streak; an event fires; the agent runs against just that patient and writes a flagged-attention row. The on-call specialist sees the alert.

Both paths are the same code — the trigger differs, the runtime is identical.

Tool calling with own principal scope

Each agent has its own row in principals (with the agent profile) and its own role grants in the org. RLS scopes the agent exactly like a human — the agent can read what its assigned scope permits and nothing else. No backdoors, no SECURITY DEFINER shortcuts, no agent-specific bypass policies.

Tools the agent can call are API endpoints, not direct DB access. The agent gets an X-Organization-ID header and an auth token bound to its principal; it makes HTTP calls to the same endpoints staff use. This means:

Same RLS, same audit, same rate limits, same validation
Adding a tool = exposing an existing endpoint to the agent's role grant
No "agent-only" code path that drifts from the human path

Approval workflow — never auto-act

Agent suggestions land in a specialist queue. The specialist approves, modifies, or rejects. Rejection becomes a feedback signal captured in audit (the agent's suggestion + the specialist's reason) — useful for both regulatory review and future model evaluation. The agent never directly modifies clinical data without specialist sign-off.

Exception: low-stakes patient-facing actions (sending a "great work this week" message) can run autonomously if explicitly scoped that way per agent definition — but that's a per-feature decision with its own audit trail, not the default.

Provenance and compliance

Every agent action writes:

audit_log row with actor_id = agent's principal, actor_type = 'agent'
audit_ai_provenance sibling row with model_version, inputs_hash (SHA-256 of the prompt + retrieved context), confidence if the model exposes one
Both partitioned monthly per P41; same retention as audit_log (≥ 6 years)

inputs_hash lets a regulator (or an internal audit) replay the exact decision: feed the same hash into the same model version, get a determinism check on whether the agent's suggestion was reproducible. This is the load-bearing reason audit_ai_provenance exists — it's not optional.

What we deliberately did NOT decide

Specific feature shapes. Patient progress monitoring, treatment plan drafting, intake triage, transcription — those land per-feature with their own scope, their own regulatory analysis (EU MDR / GDPR Art. 22 risk decided per feature, not platform-wide), their own UI. This doc commits the runtime, not the features.
Trigger criteria for "flag this case." Each feature defines its own thresholds. The runtime supports both rule-driven and ML-driven flagging; choosing between them is a per-feature decision.
Exact prompt templates. Captured per feature, in the feature's own spec. The runtime just executes them.
Which models for which task. Model selection per feature, with a default to "smaller model for routine, larger for flagged" — but the specifics live in feature specs.
Whether to fine-tune. Default is no — prompt engineering + retrieval + selective model size has gotten cheap and capable enough that fine-tuning rarely earns its operational cost. Revisit if a specific feature has a sustained accuracy gap that retrieval can't close.

When to revisit this doc

Trigger	What changes
Total embedding vector count > ~50M across all tables	Re-evaluate pgvector vs dedicated vector DB
LLM API spend sustained > $10k/month after applying all cost levers	Re-examine triggering logic; possibly fine-tune to compress prompts
Regulator question about a specific agent decision	The `audit_ai_provenance` replay capability needs to actually work end-to-end — exercise it before the question lands
First production agent feature scoped	Re-read this doc; update with anything the feature surfaced that contradicts it
EU AI Act high-risk classification applies to a feature	Add a §"AI Act compliance" section per the feature; doesn't change runtime, changes documentation + transparency requirements

AI Agent Runtime & Cost Economics ​

Status ​

Cost economics ​

The wrong framing: LLM-everything per patient ​

The right framing: 95% deterministic, 5% LLM ​

Operating cost ​

Reference points to judge against ​

Cost levers (apply all five) ​

Retrieval architecture ​

Agent reads Postgres aggregates, not raw ingest ​

The rollup pattern ​

Why pgvector, not Qdrant ​

Runtime pattern ​

Never synchronous in HTTP handlers ​

Tool calling with own principal scope ​

Approval workflow — never auto-act ​

Provenance and compliance ​

What we deliberately did NOT decide ​

When to revisit this doc ​

See also ​