Telemetry — Patient Engagement & Pose Data Pipeline
Layer 2 feature, post-foundation. Decisions in decisions.md → Why telemetry is PG + S3, not ClickHouse. Foundation primitives this stack relies on (per-purpose consent,
events.Bus, Cat F service-accounts, data classification) are already in place — nothing more ships from foundation. Until Layer 2 begins, no Telemetry API exists; this doc describes the shape it will take when it does.
Scope
Telemetry exists for two concrete product needs:
- Exercise video engagement — what patients watch, when they pause/seek, completion rates, buffering experience.
- Pose-detection data — MediaPipe landmark frames captured during pose-tracked exercises, server-side computed form scores, full-session replay for specialist review.
Everything else that earlier specs called "telemetry" is out of scope because foundation primitives already cover it:
| Concern | Lives in |
|---|---|
| Compliance audit (who did what, GDPR/MDR forensic trail) | audit_log (P10, partitioned monthly per P41) — not telemetry |
| Usage-based billing data | usage_records / usage_quotas / usage_summaries (1C.7) |
| AI provenance | audit_ai_provenance sibling to audit_log |
| Security signals | mostly audit_log; SIEM-shaped signals out of scope |
| App observability (latency, traces) | OTel → Datadog/Grafana, not bespoke |
| Cross-tenant analytics | none — readers are all clinic-scoped (specialist, patient, clinic admin) |
The platform's only telemetry-shaped workload is high-volume time-series ingest of pose landmarks and video-engagement events, scoped to a single clinic per session, read by the same clinic's specialists or the patient themselves.
Architecture
Patient Portal ─── signed-session-token ───► Telemetry API ──┐
(browser) binary float32, gzip │
├─► S3 per-batch PUT (in-flight buffer)
│ {org_id}/{session_id}/inflight/{batch_seq}.bin.gz
│ (durable from first batch arrival)
│
├─► Server-side aggregation
│ (form_score, ROM, rep_count from
│ landmarks at session_end)
│
└─► events.Bus event ─► Core API subscriber
│
└─► Postgres aggregates
(RLS, audit, classification)
At session_end:
─► S3: finalized replay blob
s3://restartix-telemetry/{org_id}/{session_id}.bin.gz
(concatenated from in-flight parts;
in-flight prefix deleted)
Specialist ─┐
Patient (own) ├─► Core API ──► Postgres aggregates (dashboards)
Clinic admin ─┘ S3 replay blob fetch (replay viewer)
Console superadmin ─► Core API ──► Postgres aggregates (anonymised cross-tenant counters only)Key design choices (rationale in decisions.md):
- Separate Go service (
services/telemetry/) — hard ingest isolation from Core API's transactional pool. Cat F service-account principal when calling back to Core API. - Postgres + S3, not ClickHouse — at the workload's actual shape (per-rep aggregates queryable by clinic-scoped readers + per-session replay blobs), PG carries to 50k+ peak concurrent users with monthly partitioning + materialized views + a read replica. ClickHouse is the Tier 3 escape hatch for cross-tenant analytical workloads we don't have today.
- Server-side aggregation — Telemetry API computes form_score / ROM / rep_count from landmarks at session_end and publishes via
events.Bus. Client-side aggregation rejected because the patient controls the value. - Signed session token on the hot path — issued by Core API at exercise-session start (claims:
principal_id,org_id,exercise_session_id,exp). Verified by signature only; no Clerk JWT verify per pose-frame batch. - No pseudonymization — readers are all clinic-scoped, so principal_id + org_id are stored plain. Pseudonymization existed to make cross-tenant aggregates safe; we don't have those readers.
Endpoints
Three typed endpoints, narrow scope:
| Endpoint | Purpose | Auth |
|---|---|---|
POST /v1/pose/frames | Pose batch ingest (1-sec batches of MediaPipe landmarks) | Signed session token |
POST /v1/media/events | Video lifecycle (play/pause/seek/heartbeat/buffering/milestone/end) | Signed session token |
POST /v1/sessions/{id}/end | Session finalizer; triggers server-side aggregation + S3 blob finalize | Signed session token |
No generic /analytics/track or /errors/report. App-internal analytics (automation execution counts, etc.) are not telemetry — they're domain events on events.Bus or rows in domain tables. Errors → off-the-shelf (Sentry-style) when/if needed.
Full request/response shapes in api.md. Event schemas for video lifecycle in media-events.md.
Storage
Postgres (clinical aggregates)
Lives in the main RDS cluster, RLS-scoped, audit-logged, classified. Written by the Core API events.Bus subscriber, never directly by Telemetry API.
| Table | Cardinality | Partition |
|---|---|---|
pose_session_metrics | 1/session | None (state-shaped) |
pose_rep_metrics | ~100/session | Range-partitioned monthly (P41 — event-shaped) |
media_session_metrics | 1/session | None (state-shaped) |
media_buffering_events | ~5/session | Range-partitioned monthly (P41) |
Plus updates to existing patient_exercise_logs (video_watch_percentage, pose_accuracy_score, actual_sets, actual_reps).
S3 (replay blobs)
Two S3 paths per session, one canonical:
In-flight (during the session): each 1-second batch lands as a small object under
s3://restartix-telemetry/{org_id}/{session_id}/inflight/{batch_seq}.bin.gzPer-batch PUT (not S3 multipart) — multipart's 5 MB minimum part size is incompatible with ~1-2 KB compressed batches. Each batch is durable as soon as Telemetry API confirms the PUT; this is what the "blast radius of 1 second of latest batch" claim in the resilience section rests on.
Finalized (at session_end): Telemetry API streams the in-flight parts into a single canonical blob and deletes the in-flight prefix:
s3://restartix-telemetry/{org_id}/{session_id}.bin.gzBinary format: [1-byte version][4-byte frame_count][4-byte fps][N × 33 × 4 × float32 landmarks][gzip]. ~3 MB per 30-min session at 10fps.
Lifecycle (canonical blob): standard → IA at 90 days → Glacier at 1 year → expire at retention horizon. In-flight prefix has a 24-hour expiry as a safety net for orphaned sessions where the finalize step never ran.
Replay = canonical-blob fetch via Core API (signed S3 URL). No queryable replay store.
Cost shape at launch: ~1800 batches/session × ~700 sessions/day at first-paying-clinic scale = ~1.3M PUTs/day. At S3 PUT pricing in eu-central-1 that's ~$5-10/month — a real number, but rounding error compared to the rest of the stack. The SessionBuffer swap-point interface preserves the option to move the streaming layer to Kinesis at Tier 2 (10k+ peak concurrent) when per-batch PUT volume crosses the per-prefix S3 rate-limit horizon.
Egress bytes
Stay in usage_records (1C.7). Do not duplicate in telemetry.
Pose payload encoding
MediaPipe runs entirely client-side (WebAssembly + WebGL/WebGPU) and outputs 33 landmarks (x, y, z, visibility) per frame. We transmit the landmarks, not the video — at 10fps a 30-min session is ~3 MB encoded, vs. tens of GB of video.
Wire format: binary float32 (33 × 4 × 4 = 528 bytes/frame), 1-second batches, gzipped per batch. The repo's exercise_recording_*.json test sample shows raw MediaPipe output at ~5910 bytes/frame as JSON — ~11× wasteful. JSON is forbidden on the wire; binary float32 is the canonical codec, behind the LandmarkCodec swap-point interface.
Precision: keypoints are normalized 0–1 floats; float32 gives 0.13-pixel resolution on 1280×720 video — well below clinical signal threshold.
Trust model
The patient device controls the data. Mitigations layered:
- Server computes form_score / ROM / rep_count from landmarks. Client cannot lie about the score, only about input quality.
pose_confidencechecks flag implausible inputs (camera at a wall, MediaPipe paused).- Cadence checks flag missing batches (gaps > expected fps).
- Session-flagged-unverified when checks fail; specialist sees the flag.
Server-side rerun from uploaded video would close the remaining gap but adds biometric-video archive (compliance-heavy) — out of scope today, can ship later for clinical-grade verification on flagged sessions.
Aggregation engine
The trust-model section above commits to server-side computation of session metrics from the landmark stream at session_end. The engine that does this is scoped to Class I MDR posture — informational rep count + range-of-motion (ROM) + session-completion signals, not clinical-grade form scoring used to drive treatment decisions. See CLAUDE.md → Medical Device Readiness for the regulatory framing.
What the engine actually has to do at Class I scope:
| Signal | Implementation shape | Notes |
|---|---|---|
| Session completion | Boolean: pose detection ran, landmark stream is non-empty for the expected duration | Trivial — landmark presence + duration check |
| Rep count (estimated) | Per-exercise heuristic: peak detection on the relevant keypoint axis (e.g., hip y-coordinate for squats, wrist y-coordinate for arm raises) | Each exercise tunes which keypoint + axis is the rep signal. Static-hold exercises (planks) report duration-held instead of rep count. |
| Range of motion (informational) | Min/max joint angle measured during the session, computed from MediaPipe 3D keypoints | Pure geometry. Displayed as informational data, not a scored output. |
| Pose confidence | Surface MediaPipe's own confidence values + cadence checks for missing batches | Already in the Trust model section; this just exposes the signal to the specialist. |
What the engine deliberately does NOT do at Class I:
- Form scoring — calling a number "form_score" implies a clinical judgment about quality of movement. At Class I we don't compute or display this. If/when the platform upgrades to Class IIa with clinical validation, the swap-point interfaces support adding a scoring layer.
- Treatment-decision support — the engine doesn't suggest "increase the resistance" or "reduce reps." Specialists make those calls from the informational data.
- Diagnosis-adjacent measurements — no "asymmetry index," "compensatory pattern detection," or other diagnostic signals.
Why this scope is tractable in-house:
A Go engineer with MediaPipe documentation, the per-exercise reference videos, and a few weekends gets to "rep count + ROM + session-completed" with heuristics. No biomechanics PhD required. The engineering shape is: small per-exercise tuning files (which keypoint, which axis, which threshold), shared geometry helpers (joint angles from 3D keypoints), and a stateless aggregator that runs at session_end. The architecture (signed-session-token, PG aggregates, S3 replay, swap-point interfaces) is designed for the eventual Class IIa upgrade without rewrite — the swap point is the engine itself, not the surrounding pipeline.
What still gets decided at F9 implementation time (deferred, not committed today):
- Per-exercise reference model storage shape (
exercisestable JSON column vs. dedicatedexercise_pose_modelstable). The engineer building F9 picks this with real exercise data in front of them. - Algorithm versioning policy in the data model (
algorithm_versioncolumn onpose_session_metricsis the single durable commitment — once recorded, scores never get retroactively rewritten; new algorithm versions produce new rows or new columns, never overwrite old ones). - Replay-blob retention durations (lifecycle policy on the S3 bucket; one Terraform change at provisioning time).
- events.Bus failure path mechanics (Telemetry API → Core API HTTP with Cat F service-account; outbox-with-retry pattern matching foundation 1A.18 notification dispatcher shape).
The AggregateStore swap-point interface guarantees the storage side is bounded; the engine itself is a small heuristic Go package owned by F9, not a scoping gap that blocks design.
Resilience
- Client buffers 1-second batches; falls back to IndexedDB on disconnect, retries on reconnect.
POST /v1/sessions/{id}/endfinalizes the session: triggers aggregation, writes S3 blob, publishes events.Bus event.- Server-side timeout (10 min silence) finalizes orphaned sessions (closed browser, dead device) as
incomplete— partial data preserved. - Blast radius of any failure: at most ~1 second of the latest batch.
Consent
Two named per-purpose flags using the foundation per-purpose consent flow (1B.9):
| Purpose code | Gates |
|---|---|
analytics | Media events (video lifecycle) |
biometric | Pose ingest |
The previous spec's 0–3 consent ladder is rejected — it does not match the platform's actual per-purpose consent model. Telemetry API rejects ingest if the matching consent flag is absent for the patient. Withdrawal takes effect immediately.
Reads
All reads flow through Core API. No browser direct telemetry-DB or telemetry-S3 access; replay viewer in the Clinic app fetches a signed S3 URL from Core API.
| Reader | Data | Access shape |
|---|---|---|
| Patient (Portal) | Own progress, own session history | RLS by principal_id |
| Specialist (Clinic app) | Their org's patients' sessions, replays | RLS by org + permission |
| Clinic admin (Clinic app) | Cohort dashboards across their org | RLS by org + permission |
| Console superadmin | Anonymised cross-tenant aggregate counters only (processor rule) | AdminPool, classification-filtered |
Swap-point interfaces
Mandatory from day one — these are the seams that make tier 1/2/3 scaling bounded:
| Interface | Today | Possible later swap |
|---|---|---|
AggregateStore.WriteRepMetric / WriteSessionMetric | PG INSERT | Dual-write to PG + CH at Tier 3 |
AggregateQuery.GetSessionMetrics / GetCohortAggregate | PG read | Per-dashboard CH read at Tier 3 |
SessionBuffer.AppendBatch / Finalize | Per-batch S3 PUT under {session_id}/inflight/ + concatenate-and-finalize at session_end | Kinesis if 10k+ peak concurrent or per-prefix S3 PUT rate forces it |
ReplayBlobStore.Get / Put | S3 | Pluggable; unlikely to swap |
LandmarkCodec.Encode / Decode | binary float32 + gzip | Protobuf or custom binary |
SignedSessionToken.Issue / Verify | HS256 with rotating secret | Ed25519 if needed |
Repository pattern enforces this — handlers and aggregator never touch PG/S3 directly. CI guard (cmd/check-telemetry-bounds) rejects direct imports if it ever drifts.
Scaling roadmap
| Tier | Peak concurrent | What's added | Trigger to advance |
|---|---|---|---|
| 0 — launch | up to ~1 000 | Single Telemetry API instance, PG primary, S3 | — |
| 1 | 1 000 – 10 000 | Telemetry API horizontal (2–4 instances), PG read replica, materialized views per dashboard | Dashboard p95 > 500ms after index tuning |
| 2 | 10 000 – 50 000 | Monthly partitioning on rep_metrics, Kinesis between Telemetry API and S3, Athena/Glue jobs for ad-hoc analytics over S3 blobs | Replica lag, S3 multipart rate limits, or batch jobs becoming a daily need |
| 3 | 50 000+ | ClickHouse for cross-tenant analytics surfaces only (clinical aggregates stay in PG) | Cross-tenant dashboard query consistently > 1s after materialized view tuning |
Tiers 1 and 2 are pure ops/config — no code rewrite if the swap-point interfaces are honored from day one.
For the legacy product's growth trajectory (20k+ users today, scaling at typical SaaS rates), Tier 3 is many years out — possibly never.
Daily.co and other media
Daily.co data stays in Daily.co. Media events here cover exercise video playback (Bunny Stream / S3 origin), not appointment video calls. 1:1 live video exercises are not in scope today.
Foundation status
Nothing to build for telemetry in foundation. The four primitives Layer 2 will rely on are all in place:
- Cat F service-accounts (Layer 1.24) — for Telemetry API → Core API callbacks
events.Bus(1C.3) — for the aggregation event channel- Per-purpose consent ledger (1B.9) — for
analytics+biometricgating - Data classification framework (1A, P39) — extends to PG aggregate columns + S3 blob class
- Signed-token pattern (HS256 helper, small new addition when 1C ships)
Nothing in services/telemetry/ exists today. When Layer 2 telemetry work begins, this doc + api.md + media-events.md are the locked design.