Telemetry — Patient Engagement & Pose Data Pipeline

Layer 2 feature, post-foundation. Decisions in decisions.md → Why telemetry is PG + S3, not ClickHouse. Foundation primitives this stack relies on (per-purpose consent, events.Bus, Cat F service-accounts, data classification) are already in place — nothing more ships from foundation. Until Layer 2 begins, no Telemetry API exists; this doc describes the shape it will take when it does.

Scope

Telemetry exists for two concrete product needs:

Exercise video engagement — what patients watch, when they pause/seek, completion rates, buffering experience.
Pose-detection data — MediaPipe landmark frames captured during pose-tracked exercises, server-side computed form scores, full-session replay for specialist review.

Everything else that earlier specs called "telemetry" is out of scope because foundation primitives already cover it:

Concern	Lives in
Compliance audit (who did what, GDPR/MDR forensic trail)	`audit_log` (P10, partitioned monthly per P41) — not telemetry
Usage-based billing data	`usage_records` / `usage_quotas` / `usage_summaries` (1C.7)
AI provenance	`audit_ai_provenance` sibling to `audit_log`
Security signals	mostly `audit_log`; SIEM-shaped signals out of scope
App observability (latency, traces)	OTel → Datadog/Grafana, not bespoke
Cross-tenant analytics	none — readers are all clinic-scoped (specialist, patient, clinic admin)

The platform's only telemetry-shaped workload is high-volume time-series ingest of pose landmarks and video-engagement events, scoped to a single clinic per session, read by the same clinic's specialists or the patient themselves.

Architecture

Patient Portal ─── signed-session-token ───► Telemetry API ──┐
  (browser)         binary float32, gzip                      │
                                                              ├─► S3 per-batch PUT (in-flight buffer)
                                                              │   {org_id}/{session_id}/inflight/{batch_seq}.bin.gz
                                                              │   (durable from first batch arrival)
                                                              │
                                                              ├─► Server-side aggregation
                                                              │   (form_score, ROM, rep_count from
                                                              │    landmarks at session_end)
                                                              │
                                                              └─► events.Bus event ─► Core API subscriber
                                                                                       │
                                                                                       └─► Postgres aggregates
                                                                                           (RLS, audit, classification)

                                                              At session_end:
                                                              ─► S3: finalized replay blob
                                                                  s3://restartix-telemetry/{org_id}/{session_id}.bin.gz
                                                                  (concatenated from in-flight parts;
                                                                   in-flight prefix deleted)

Specialist  ─┐
Patient (own) ├─► Core API ──► Postgres aggregates (dashboards)
Clinic admin ─┘                  S3 replay blob fetch (replay viewer)

Console superadmin ─► Core API ──► Postgres aggregates (anonymised cross-tenant counters only)

Key design choices (rationale in decisions.md):

Separate Go service (services/telemetry/) — hard ingest isolation from Core API's transactional pool. Cat F service-account principal when calling back to Core API.
Postgres + S3, not ClickHouse — at the workload's actual shape (per-rep aggregates queryable by clinic-scoped readers + per-session replay blobs), PG carries to 50k+ peak concurrent users with monthly partitioning + materialized views + a read replica. ClickHouse is the Tier 3 escape hatch for cross-tenant analytical workloads we don't have today.
Server-side aggregation — Telemetry API computes form_score / ROM / rep_count from landmarks at session_end and publishes via events.Bus. Client-side aggregation rejected because the patient controls the value.
Signed session token on the hot path — issued by Core API at exercise-session start (claims: principal_id, org_id, exercise_session_id, exp). Verified by signature only; no Clerk JWT verify per pose-frame batch.
No pseudonymization — readers are all clinic-scoped, so principal_id + org_id are stored plain. Pseudonymization existed to make cross-tenant aggregates safe; we don't have those readers.

Endpoints

Three typed endpoints, narrow scope:

Endpoint	Purpose	Auth
`POST /v1/pose/frames`	Pose batch ingest (1-sec batches of MediaPipe landmarks)	Signed session token
`POST /v1/media/events`	Video lifecycle (play/pause/seek/heartbeat/buffering/milestone/end)	Signed session token
`POST /v1/sessions/{id}/end`	Session finalizer; triggers server-side aggregation + S3 blob finalize	Signed session token

No generic /analytics/track or /errors/report. App-internal analytics (automation execution counts, etc.) are not telemetry — they're domain events on events.Bus or rows in domain tables. Errors → off-the-shelf (Sentry-style) when/if needed.

Full request/response shapes in api.md. Event schemas for video lifecycle in media-events.md.

Storage

Postgres (clinical aggregates)

Lives in the main RDS cluster, RLS-scoped, audit-logged, classified. Written by the Core API events.Bus subscriber, never directly by Telemetry API.

Table	Cardinality	Partition
`pose_session_metrics`	1/session	None (state-shaped)
`pose_rep_metrics`	~100/session	Range-partitioned monthly (P41 — event-shaped)
`media_session_metrics`	1/session	None (state-shaped)
`media_buffering_events`	~5/session	Range-partitioned monthly (P41)

Plus updates to existing patient_exercise_logs (video_watch_percentage, pose_accuracy_score, actual_sets, actual_reps).

S3 (replay blobs)

Two S3 paths per session, one canonical:

In-flight (during the session): each 1-second batch lands as a small object under

s3://restartix-telemetry/{org_id}/{session_id}/inflight/{batch_seq}.bin.gz

Per-batch PUT (not S3 multipart) — multipart's 5 MB minimum part size is incompatible with ~1-2 KB compressed batches. Each batch is durable as soon as Telemetry API confirms the PUT; this is what the "blast radius of 1 second of latest batch" claim in the resilience section rests on.

Finalized (at session_end): Telemetry API streams the in-flight parts into a single canonical blob and deletes the in-flight prefix:

s3://restartix-telemetry/{org_id}/{session_id}.bin.gz

Binary format: [1-byte version][4-byte frame_count][4-byte fps][N × 33 × 4 × float32 landmarks][gzip]. ~3 MB per 30-min session at 10fps.

Lifecycle (canonical blob): standard → IA at 90 days → Glacier at 1 year → expire at retention horizon. In-flight prefix has a 24-hour expiry as a safety net for orphaned sessions where the finalize step never ran.

Replay = canonical-blob fetch via Core API (signed S3 URL). No queryable replay store.

Cost shape at launch: ~1800 batches/session × ~700 sessions/day at first-paying-clinic scale = ~1.3M PUTs/day. At S3 PUT pricing in eu-central-1 that's ~$5-10/month — a real number, but rounding error compared to the rest of the stack. The SessionBuffer swap-point interface preserves the option to move the streaming layer to Kinesis at Tier 2 (10k+ peak concurrent) when per-batch PUT volume crosses the per-prefix S3 rate-limit horizon.

Egress bytes

Stay in usage_records (1C.7). Do not duplicate in telemetry.

Pose payload encoding

MediaPipe runs entirely client-side (WebAssembly + WebGL/WebGPU) and outputs 33 landmarks (x, y, z, visibility) per frame. We transmit the landmarks, not the video — at 10fps a 30-min session is ~3 MB encoded, vs. tens of GB of video.

Wire format: binary float32 (33 × 4 × 4 = 528 bytes/frame), 1-second batches, gzipped per batch. The repo's exercise_recording_*.json test sample shows raw MediaPipe output at ~5910 bytes/frame as JSON — ~11× wasteful. JSON is forbidden on the wire; binary float32 is the canonical codec, behind the LandmarkCodec swap-point interface.

Precision: keypoints are normalized 0–1 floats; float32 gives 0.13-pixel resolution on 1280×720 video — well below clinical signal threshold.

Trust model

The patient device controls the data. Mitigations layered:

Server computes form_score / ROM / rep_count from landmarks. Client cannot lie about the score, only about input quality.
pose_confidence checks flag implausible inputs (camera at a wall, MediaPipe paused).
Cadence checks flag missing batches (gaps > expected fps).
Session-flagged-unverified when checks fail; specialist sees the flag.

Server-side rerun from uploaded video would close the remaining gap but adds biometric-video archive (compliance-heavy) — out of scope today, can ship later for clinical-grade verification on flagged sessions.

Aggregation engine

The trust-model section above commits to server-side computation of session metrics from the landmark stream at session_end. The engine that does this is scoped to Class I MDR posture — informational rep count + range-of-motion (ROM) + session-completion signals, not clinical-grade form scoring used to drive treatment decisions. See CLAUDE.md → Medical Device Readiness for the regulatory framing.

What the engine actually has to do at Class I scope:

Signal	Implementation shape	Notes
Session completion	Boolean: pose detection ran, landmark stream is non-empty for the expected duration	Trivial — landmark presence + duration check
Rep count (estimated)	Per-exercise heuristic: peak detection on the relevant keypoint axis (e.g., hip y-coordinate for squats, wrist y-coordinate for arm raises)	Each exercise tunes which keypoint + axis is the rep signal. Static-hold exercises (planks) report duration-held instead of rep count.
Range of motion (informational)	Min/max joint angle measured during the session, computed from MediaPipe 3D keypoints	Pure geometry. Displayed as informational data, not a scored output.
Pose confidence	Surface MediaPipe's own confidence values + cadence checks for missing batches	Already in the Trust model section; this just exposes the signal to the specialist.

What the engine deliberately does NOT do at Class I:

Form scoring — calling a number "form_score" implies a clinical judgment about quality of movement. At Class I we don't compute or display this. If/when the platform upgrades to Class IIa with clinical validation, the swap-point interfaces support adding a scoring layer.
Treatment-decision support — the engine doesn't suggest "increase the resistance" or "reduce reps." Specialists make those calls from the informational data.
Diagnosis-adjacent measurements — no "asymmetry index," "compensatory pattern detection," or other diagnostic signals.

Why this scope is tractable in-house:

A Go engineer with MediaPipe documentation, the per-exercise reference videos, and a few weekends gets to "rep count + ROM + session-completed" with heuristics. No biomechanics PhD required. The engineering shape is: small per-exercise tuning files (which keypoint, which axis, which threshold), shared geometry helpers (joint angles from 3D keypoints), and a stateless aggregator that runs at session_end. The architecture (signed-session-token, PG aggregates, S3 replay, swap-point interfaces) is designed for the eventual Class IIa upgrade without rewrite — the swap point is the engine itself, not the surrounding pipeline.

What still gets decided at F9 implementation time (deferred, not committed today):

Per-exercise reference model storage shape (exercises table JSON column vs. dedicated exercise_pose_models table). The engineer building F9 picks this with real exercise data in front of them.
Algorithm versioning policy in the data model (algorithm_version column on pose_session_metrics is the single durable commitment — once recorded, scores never get retroactively rewritten; new algorithm versions produce new rows or new columns, never overwrite old ones).
Replay-blob retention durations (lifecycle policy on the S3 bucket; one Terraform change at provisioning time).
events.Bus failure path mechanics (Telemetry API → Core API HTTP with Cat F service-account; outbox-with-retry pattern matching foundation 1A.18 notification dispatcher shape).

The AggregateStore swap-point interface guarantees the storage side is bounded; the engine itself is a small heuristic Go package owned by F9, not a scoping gap that blocks design.

Resilience

Client buffers 1-second batches; falls back to IndexedDB on disconnect, retries on reconnect.
POST /v1/sessions/{id}/end finalizes the session: triggers aggregation, writes S3 blob, publishes events.Bus event.
Server-side timeout (10 min silence) finalizes orphaned sessions (closed browser, dead device) as incomplete — partial data preserved.
Blast radius of any failure: at most ~1 second of the latest batch.

Two named per-purpose flags using the foundation per-purpose consent flow (1B.9):

Purpose code	Gates
`analytics`	Media events (video lifecycle)
`biometric`	Pose ingest

The previous spec's 0–3 consent ladder is rejected — it does not match the platform's actual per-purpose consent model. Telemetry API rejects ingest if the matching consent flag is absent for the patient. Withdrawal takes effect immediately.

Reads

All reads flow through Core API. No browser direct telemetry-DB or telemetry-S3 access; replay viewer in the Clinic app fetches a signed S3 URL from Core API.

Reader	Data	Access shape
Patient (Portal)	Own progress, own session history	RLS by `principal_id`
Specialist (Clinic app)	Their org's patients' sessions, replays	RLS by org + permission
Clinic admin (Clinic app)	Cohort dashboards across their org	RLS by org + permission
Console superadmin	Anonymised cross-tenant aggregate counters only (processor rule)	AdminPool, classification-filtered

Swap-point interfaces

Mandatory from day one — these are the seams that make tier 1/2/3 scaling bounded:

Interface	Today	Possible later swap
`AggregateStore.WriteRepMetric / WriteSessionMetric`	PG INSERT	Dual-write to PG + CH at Tier 3
`AggregateQuery.GetSessionMetrics / GetCohortAggregate`	PG read	Per-dashboard CH read at Tier 3
`SessionBuffer.AppendBatch / Finalize`	Per-batch S3 PUT under `{session_id}/inflight/` + concatenate-and-finalize at session_end	Kinesis if 10k+ peak concurrent or per-prefix S3 PUT rate forces it
`ReplayBlobStore.Get / Put`	S3	Pluggable; unlikely to swap
`LandmarkCodec.Encode / Decode`	binary float32 + gzip	Protobuf or custom binary
`SignedSessionToken.Issue / Verify`	HS256 with rotating secret	Ed25519 if needed

Repository pattern enforces this — handlers and aggregator never touch PG/S3 directly. CI guard (cmd/check-telemetry-bounds) rejects direct imports if it ever drifts.

Scaling roadmap

Tier	Peak concurrent	What's added	Trigger to advance
0 — launch	up to ~1 000	Single Telemetry API instance, PG primary, S3	—
1	1 000 – 10 000	Telemetry API horizontal (2–4 instances), PG read replica, materialized views per dashboard	Dashboard p95 > 500ms after index tuning
2	10 000 – 50 000	Monthly partitioning on rep_metrics, Kinesis between Telemetry API and S3, Athena/Glue jobs for ad-hoc analytics over S3 blobs	Replica lag, S3 multipart rate limits, or batch jobs becoming a daily need
3	50 000+	ClickHouse for cross-tenant analytics surfaces only (clinical aggregates stay in PG)	Cross-tenant dashboard query consistently > 1s after materialized view tuning

Tiers 1 and 2 are pure ops/config — no code rewrite if the swap-point interfaces are honored from day one.

For the legacy product's growth trajectory (20k+ users today, scaling at typical SaaS rates), Tier 3 is many years out — possibly never.

Daily.co and other media

Daily.co data stays in Daily.co. Media events here cover exercise video playback (Bunny Stream / S3 origin), not appointment video calls. 1:1 live video exercises are not in scope today.

Foundation status

Nothing to build for telemetry in foundation. The four primitives Layer 2 will rely on are all in place:

Cat F service-accounts (Layer 1.24) — for Telemetry API → Core API callbacks
events.Bus (1C.3) — for the aggregation event channel
Per-purpose consent ledger (1B.9) — for analytics + biometric gating
Data classification framework (1A, P39) — extends to PG aggregate columns + S3 blob class
Signed-token pattern (HS256 helper, small new addition when 1C ships)

Nothing in services/telemetry/ exists today. When Layer 2 telemetry work begins, this doc + api.md + media-events.md are the locked design.

Telemetry — Patient Engagement & Pose Data Pipeline ​

Scope ​

Architecture ​

Endpoints ​

Storage ​

Postgres (clinical aggregates) ​

S3 (replay blobs) ​

Egress bytes ​

Pose payload encoding ​

Trust model ​

Aggregation engine ​

Resilience ​

Consent ​

Reads ​

Swap-point interfaces ​

Scaling roadmap ​

Daily.co and other media ​

Foundation status ​