Exercise Video Composition

The compositional model for how exercise videos are built from filming primitives and served to patients. Implements P56.

Older spec superseded

This document supersedes video-upload.md's "one video per exercise uploaded via Core API" model. The platform now composes multiple videos per exercise — one MP4 per (exercise, recipe, language) tuple — from raw filming primitives. video-upload.md remains as a historical reference until the F9.1 schema audit is done.

Two kinds of exercise, two flows

Every exercise has a kind that captures its dosing model and how its video gets produced:

reps_based — primitives in S3 (intro / pauza / outro / 5-rep blocks per side / per-side VO masters); composer renders one MP4 per (exercise, recipe, language) recipe. Many cached renders over time. This document focuses on this flow.
duration_based — one pre-baked MP4 sitting on Bunny, used as-is. No primitives, no composer, no recipe. One Bunny video per (exercise, language). Imported from the old platform initially; new native duration_based exercises can be added later by uploading their pre-baked MP4 directly. See reference/exercise-content-pipeline.md for the import workflow.

Both kinds share the same exercises + exercise_renders data model. The difference is whether exercise_renders accumulates many rows for the exercise (reps_based) or has exactly one row per language pointing at the pre-baked MP4 (duration_based).

What "composition" means here (reps_based only)

A patient's daily rehab session is a playlist of per-exercise videos, one MP4 per exercise. For a reps_based exercise prescribed at a specific dose (e.g. "Lumbar Detensioning, 2 sets × 10 reps, alternating sides"), the platform produces one MP4 that contains:

intro → set 1 → pauza → set 2 → outro

That MP4 is not pre-existing — it's composed on demand from a bundle of raw filming primitives the filming/audio team uploads to our S3 bucket. Composition is cached at (exercise, recipe_hash, language) so multiple treatment plans prescribing the same exercise at the same dose share the same render.

The actual composition engine lives in services/exercise-composer/; the algorithm was iterated in experiments/exercise-composer/. This document is the spec the engine implements.

Two views of the catalog

Audience	What they see	Underlying data
Console / Clinic app	The raw exercise — 5-rep video as the main media (reps_based) or the pre-baked MP4 (duration_based), plus intro/pauza/outro assets, language tabs, instructions, metadata. Used for creating treatment plans / guided sessions / browsing the library.	`exercises` row + S3 asset bundle reference (reps_based) or `bunny_video_id` (duration_based).
Portal (patient)	One playable Bunny video per exercise — the canonical "catalog preview". For reps_based: a pre-baked render of `intro + 5 reps left + pauza + 5 reps right + outro` (or one side if unilateral). For duration_based: the pre-baked MP4 itself. Used when a patient does a random exercise outside a prescribed session.	`exercises.catalog_render_id` → one row in `exercise_renders`.

This avoids the patient-catalog explosion problem: even if lumbar-detensioning has been rendered at 2×5, 2×10, and 3×5 for various treatment plans, the patient catalog shows one entry with the canonical preview. Other renders only surface when a patient opens a prescribed session that uses them.

The catalog preview recipe for reps_based is derived automatically from manifest.sides:

sides: ["left", "right"]  →  preview = [{left, 5}, {right, 5}]
sides: ["left"]           →  preview = [{left, 5}]

No need to store the preview recipe on the exercise; it's a convention enforced by Core API at publish time.

The asset bundle

Per exercise, the filming/audio team delivers a bundle organised in S3 under s3://restartix-exercise-assets-{env}/{exercise-slug}/:

{exercise-slug}/
├── manifest.json                     # technical composition contract
├── intro-video-{1,2,3}.mp4           # silent/breathing-only framing, 3 variants
├── pauza-video-{1,2,3}.mp4           # inter-set rest, 3 variants
├── outro-video-{1,2,3}.mp4           # wrap-up, 3 variants
├── rep-left-video-{1,2,3}.mp4        # 5-rep block, left side, breathing only
├── rep-right-video-{1,2,3}.mp4       # 5-rep block, right side, breathing only
└── audio/{lang}/
    ├── intro-vo-{1,2,3}.mp3          # coached VO for intro, 3 variants
    ├── pauza-vo-{1,2,3}.mp3          # coached VO for pauza, 3 variants
    ├── outro-vo-{1,2,3}.mp3          # coached VO for outro, 3 variants
    ├── rep-left-vo-{1,2,3}.mp3       # 20-count coached VO matching left rep tempo
    └── rep-right-vo-{1,2,3}.mp3      # 20-count coached VO matching right rep tempo

Total ~30 files per exercise per language. ~280 MB on disk for an exercise like lumbar detensioning.

The manifest

manifest.json is the technical contract the composer reads at job start:

json

{
  "exercise": "lumbar-detensioning",
  "reps_per_video_block": 5,
  "counts_per_audio_master": 20,
  "sides": ["left", "right"],
  "languages": ["ro"]
}

reps_per_video_block — how many reps are in each rep video file (always 5).
counts_per_audio_master — how many counts the VO master covers (always 20). Sets the maximum reps per set.
sides — which sides exist. Usually ["left", "right"]; bilateral exercises that don't switch sides would be a future variant.
languages — which languages have audio recorded. Adding a language = adding audio/{new_lang}/ + listing it here.

Variant counts are auto-discovered from the filesystem — the manifest does not need to declare them.

Tier-1 only. This manifest carries technical composition fields. Clinical/UI metadata (display name, categories, body regions, contraindications, difficulty, equipment) lives in the platform DB (the exercises table in F9.1), not in the manifest. The manifest is the contract between filming and composer; the DB is the contract between platform and clinicians/patients.

The production constraints

Three constraints on the filming/audio team that the entire model rests on:

1. Intro / pauza / outro videos have no lip-synced dialogue

The therapist on camera in intro/pauza/outro segments does not speak to camera with scripted dialogue. They gesture, demonstrate setup positions, transition into the rep starting position. Mouth movements are not tied to specific words. Reason: the VO is recorded separately per language and laid over at render time. Lip-sync mismatch (Romanian mouth shapes + English audio, or vice versa) would be jarring.

The therapist can be conversational and warm; they just can't be reading a specific script that locks the audio to their mouth.

2. Rep videos are silent (breathing only)

The therapist on camera demonstrates reps without speaking. Natural breathing sounds are fine and clinically useful (they're mixed under the VO in the final). The voice the patient hears during reps is the booth-recorded VO for that side.

3. Rep tempo is locked per exercise per side; VO tempo matches

The on-camera therapist reps at a consistent tempo across all 3 video variants of a given side, and the audio booth records the side's VO at the same tempo. The composer trusts this alignment — it doesn't time-stretch or align counts to rep boundaries dynamically. If the filming team's tempo drifts, the count word will land slightly off the rep boundary in the final.

In practice this works to within ~0.5% drift over 20 reps (~0.6s total), which is imperceptible.

The recipe (one render request)

The composer's input contract:

json

{
  "exercise": "lumbar-detensioning",
  "language": "ro",
  "sets": [
    { "side": "left", "reps": 10 },
    { "side": "right", "reps": 10 }
  ],
  "seed": 42
}

exercise — slug matching an S3 directory under the assets bucket.
language — which audio/{lang}/ directory to read VO from. Defaults to manifest.languages[0].
sets — ordered list. Each set is one side, one rep count. Reps must be a multiple of reps_per_video_block (5), capped at counts_per_audio_master (20). Valid per-set reps: {5, 10, 15, 20}.
seed — optional. When set, variant picks are deterministic — same seed produces identical renders. Used in production for cache-key stability and in testing for reproducibility.

The composer returns:

json

{
  "bunny_video_id": "abc123-...",
  "playback_hls_url": "https://vz-{token}.b-cdn.net/abc123-.../playlist.m3u8",
  "duration_seconds": 191.34,
  "picks": {
    "intro_video": 2, "pauza_video": 1, "outro_video": 3,
    "intro_vo": 1, "pauza_vo": 2, "outro_vo": 1,
    "sets": [
      { "video_variant": 1, "vo_variant": 3 },
      { "video_variant": 2, "vo_variant": 2 }
    ]
  }
}

bunny_video_id is the persistable canonical value (store this). playback_hls_url is convenience — reconstruct at serve-time from current CDN hostname + video id, so future hostname changes don't invalidate stored URLs.

The variant model

Every slot has 3 variants delivered by filming/audio teams. Each render picks one variant per slot:

Slot	Pick mechanism	Reused?
`intro` (video + VO)	One pair picked per render	Once per render
`pauza` (video + VO)	One pair picked per render	Reused across every pauza in the same render
`outro` (video + VO)	One pair picked per render	Once per render
`rep-{side}` (video + VO)	One pair picked per set	Each set picks independently

With ~9 slot picks × 3 options each, the combinatorial space is ~20k unique renders per exercise. We bake one per (exercise, recipe, language) — most of the space is unused, but different recipes of the same exercise see different combinations naturally.

Picks are random by default, deterministic with a seed. Seeded picks are used in production so the cache key — seed = hash(exercise, recipe_hash, language) — produces the same render every time it's referenced.

The bake pipeline

When the composer receives a recipe, it does this:

Download the asset bundle from S3 to a per-job working directory (cleaned afterwards).
Load manifest + scan variants on disk.
Validate the prescription — reps must be multiples of 5 in [5, 20], sides must exist in the manifest, language must be declared.
Pick variants for every slot (seeded or random).
Bake each unique set as a normalized MP4 segment: stream-loop the rep video N/5 times silently, mix the picked side-VO over the top with breathing at 0.35 gain, VO at 1.0 gain. De-dupe by (side, reps, video_variant, vo_variant) — identical sets in the same prescription bake once and reuse.
Bake intro / pauza / outro the same way (video + VO mixed, video duration as master clock).
Concat all segments via ffmpeg's concat demuxer (-c copy, no re-encode — all segments share codec params by construction).
Upload the final MP4 to Bunny Stream.
Return the bunny_video_id + computed playback URL.

All segments are normalized to 1920×1080, 30fps, H.264 yuv420p, AAC 48kHz stereo. Bunny then transcodes the uploaded MP4 to its adaptive bitrate ladder.

Counts reset per set, by construction

The fresh-per-set counting model (each set hears "1, 2, …, N" regardless of session position) falls out of the de-dupe behavior. A "3 sets × 5 reps" prescription bakes one 5-rep set clip — which contains the first 5 counts of the side's 20-count VO master — and concats it in three times. There is no other way for the same baked clip to slot into multiple set positions. See Decisions: Why fresh-per-set counting for the clinical and production reasoning.

The audio mix

Inside each baked set:

Rep video's wild track (breathing, ambient room) → volume 0.35
Side's VO master (counting + coaching narrative) → volume 1.0

The VO dominates; breathing sits underneath as ambient texture. For lumbar detensioning specifically (a calm floor exercise where breathing pattern is part of the form), keeping the wild track audible is clinically useful — the patient hears the breathing rhythm and matches it.

The gain constants are not in the recipe — they're hard-coded in the composer. Future: expose as per-exercise manifest fields if a strenuous exercise needs different mixing (e.g. mute the wild track entirely when it's grunt + fabric rustle).

Phase 1: shipped

The Core API integration that wraps the composer landed in commit 3d95e38. End-to-end working state:

✅ Migration 000022_exercises — exercises + exercise_renders tables, RLS policies (platform-curated; SELECT for any authenticated principal, AdminPool-only writes), permission codes (exercises.read, exercises.manage), lumbar-detensioning seeded.
✅ Go domain at services/api/internal/core/domain/exercises/ — model + repository + service + handler + errors. Service.EnsureRender does the cache lookup → composer call → persist loop.
✅ Composer HTTP client at services/api/internal/integration/composer/ — bearer-token authenticated.
✅ Admin endpoint POST /v1/admin/exercises/{slug}/renders — Console-only via RequirePermission(exercises.manage).
✅ Composer service additions — bearer-token middleware on /v1/compose (empty token = anonymous-with-warning for dev), Bunny collection auto-create-and-cache (one collection per slug, idempotent list-or-create).

Phase 2 backlog

Listed in roughly the order they need to land. Most map to specific clinical or operational gates.

Composer surface

Async/queue wrapper — the composer is sync today (caller blocks ~5-15s per render). When treatment-plan creation enqueues N renders at once, the wrapper becomes a worker that consumes from a queue (River / pg-boss / SQS — pick at land time) and updates the cache row's status. Patient sees status='ready' row when available.
Asset version bump endpoint — the exercises.asset_version column exists; the Console superadmin action that bumps it (after a filming team upload) doesn't. Probably POST /v1/admin/exercises/{slug}/asset-version/bump. Eager re-render of the catalog preview is part of this action.
Eager catalog preview re-render — on asset_version bump, immediately re-render the catalog preview recipe so the patient catalog never serves a stale preview. Prescription renders stay lazy (re-bake on next request only).

Patient surface

Patient catalog endpoint — GET /v1/portal/exercises returns published exercises with their catalog_render_id-resolved bunny_video_id per the patient's language. Joins through exercise_renders so duration_based and reps_based both surface uniformly.
Session-render lookup endpoint — given a (exercise, recipe, language), return the cached bunny_video_id or 404 if not ready. Used by treatment-plan / guided-session render-readiness gating (sessions don't become patient-visible until all their renders are ready).

Console surface

Exercise CRUD UI — list / detail / edit pages for the catalog. The metadata fields below (taxonomy, instructions, contraindications) drive what the UI shows.
Catalog preview trigger — Console action that triggers the catalog preview render when publishing a draft exercise.
duration_based import workflow — manual SQL today (see reference/exercise-content-pipeline.md); promote to a Console superadmin endpoint that takes a bunny_video_id + metadata and creates the rows.

Console scope boundary. Console is the control panel for exercises — it manages the exercises row, triggers composer renders, and views the cache state (exercise_renders). It does NOT upload content. For reps_based exercises, S3 primitives arrive via manual aws s3 sync from the filming team (out of band; see content-pipeline.md → Adding a new exercise). For duration_based exercises, the admin uploads the MP4 to Bunny directly (dashboard or API), then pastes the resulting bunny_video_id into Console — Console does not proxy file uploads. Mediated uploads (Console takes an MP4 and forwards to Bunny or S3) are explicitly out of scope; revisit only if the manual workflow becomes a real friction point.

Clinical metadata (defers from F9.1 spec — was always intended to land here)

Taxonomy tables (exercise_categories, exercise_body_regions, exercise_equipment)
exercise_tags polymorphic junction
exercise_instructions (ordered steps, typed: preparation / step / form_cue / breathing / safety)
exercise_contraindications (severity: warning / contraindicated)
translations JSONB on global rows (P21) — display_name / description per language
Difficulty rating (1-5)
AI tracking config (model, landmarks, target metrics, calibration) — when telemetry pipeline lands

Compositional improvements

Variant chaining for N > block-size — today, a 20-rep set replays the same picked rep video variant. The diagram in P56 alludes to chaining different variants for the second 10 reps; the composer doesn't do that yet, but the asset bundle supports it (3 variants per side).
More languages — only Romanian (ro) today. Adding a language is mkdir audio/{lang}/ + uploading 15 VO files per exercise + updating manifest.languages.

Operational

Bunny credentials via Cat A providers resolver — env vars today, platform_service_providers row + 5-minute-TTL resolver before production launch. Requires hoisting services/api/internal/core/providers/ to a shared module so the composer service can consume it.
Async clinical workflow — once async wrapping is in, the upstream contract becomes: treatment plan creation enqueues renders, blocks (with a progress bar?) until all are ready before showing the plan to the patient. Acceptable for now since render time is bounded and clinician-initiated.

Where this fits

Asset source — reference/exercise-content-pipeline.md (S3 buckets, Bunny libraries, upload workflow for new exercises).
Pattern — P56 Exercise Video Composition Pipeline (the three-stage S3 → composer → Bunny architecture).
Decisions:
Service implementation — services/exercise-composer/
Sandbox / iteration playground — experiments/exercise-composer/

Exercise Video Composition ​

Two kinds of exercise, two flows ​

What "composition" means here (reps_based only) ​

Two views of the catalog ​

The asset bundle ​

The manifest ​

The production constraints ​

1. Intro / pauza / outro videos have no lip-synced dialogue ​

2. Rep videos are silent (breathing only) ​

3. Rep tempo is locked per exercise per side; VO tempo matches ​

The recipe (one render request) ​

The variant model ​

The bake pipeline ​

Counts reset per set, by construction ​

The audio mix ​

Phase 1: shipped ​

Phase 2 backlog ​

Where this fits ​

Exercise Video Composition

Two kinds of exercise, two flows

What "composition" means here (reps_based only)

Two views of the catalog

The asset bundle

The manifest

The production constraints

1. Intro / pauza / outro videos have no lip-synced dialogue

2. Rep videos are silent (breathing only)

3. Rep tempo is locked per exercise per side; VO tempo matches

The recipe (one render request)

The variant model

The bake pipeline

Counts reset per set, by construction

The audio mix

Phase 1: shipped

Phase 2 backlog

Where this fits