24/7 DVR — Operations
What you’re operating
Section titled “What you’re operating”FrameWorks ships a continuous DVR archive: one DVR artifact per stream session, with no per-artifact lifetime cap. Live viewers seek across a tier-bounded rolling Mist window. Replay viewers navigate the recording sliced into chapter VOD artifacts that the chapter finalization queue produces from the per-segment ledger (foghorn.dvr_segments) at each boundary close. Chapter playback uses the chapter artifact’s normal VOD playback path — there is no dvr+<chapter_id> token anymore.
If you’re triaging a DVR issue, the engineering reference is docs/architecture/dvr-continuous-archive.md. This page is the operator-facing surface.
Cluster ceilings — env vars
Section titled “Cluster ceilings — env vars”The Foghorn process running in each cluster reads two env vars to ceiling the DVR window/manifest size for tenants on that cluster:
| Env var | Meaning | Default |
|---|---|---|
DVR_CLUSTER_MAX_WINDOW_SECONDS | Maximum live DVR window any tenant on this cluster can request, in seconds | 0 (no cluster cap; tier ceilings stand) |
DVR_CLUSTER_MAX_ENTRIES | Maximum HLS playlist entries (rolling-window manifest size) | 0 (no cap) |
Set these in the cluster’s gitops env file. Topology assumption: one Foghorn process per cluster, so process env IS the per-cluster surface.
Enterprise tenants have dvr_allow_cluster_extension=true in their tier entitlements — for them the cluster setting can raise their max DVR window up to the platform ceiling (72h). For non-enterprise tenants the cluster setting only acts as a ceiling.
Tier defaults
Section titled “Tier defaults”Resolved at DVR start by pkg/dvrpolicy.Resolve. Snapshot stored on foghorn.artifacts.dvr_window_seconds so tier changes mid-stream don’t affect the in-flight session.
| Tier | Default window | Max window | Segment duration | Max manifest entries |
|---|---|---|---|---|
| Free | 30 min | 1 h | 6s | 600 |
| Supporter | 2 h | 6 h | 6s | 3,600 |
| Developer | 4 h | 12 h | 6s | 7,200 |
| Production | 4 h | 1 d | 12s | 7,200 |
| Enterprise | 4 h | 3 d (cluster opt-in) | 24s | 10,800 |
Defined in api_billing/internal/bootstrap/catalog/billing_tiers.yaml as tier entitlements.
Chapter pipeline overview
Section titled “Chapter pipeline overview”Three Foghorn-side workers run the chapter lifecycle:
| Worker | Cadence | Job |
|---|---|---|
chapter_sweeper.go | 60s | Rotates chapter boundaries on active DVRs: closes the open chapter, opens the next. |
chapter_finalization_queue.go | 30s | Picks up state='closed' chapters and dispatches a dvr_chapter_finalize processing job to the recording origin. |
chapter_reclaim_sweep.go | 60s | Once a chapter reaches state='frozen', deletes its source TS segments locally and the recovery-bridge S3 objects. |
Chapter state machine
Section titled “Chapter state machine”open → closed → finalizing → finalized → frozen → reclaimed ↓ ↓ └───→ failed_source_missing | failed_permanent| State | Meaning |
|---|---|
open | Recording in progress; the rolling DVR surface serves viewers. |
closed | Boundary reached; finalization queue will pick it up. |
finalizing | Processing job in flight on the recording origin (Mist remuxes TS → canonical .mkv). |
finalized | PUSH_END fired; chapter artifact exists locally. Waiting on freeze + .dtsh sync. |
frozen | Chapter artifact + .dtsh durably on S3. Safe to reclaim source segments. |
reclaimed | Source segments deleted; row remains as range metadata. Playback uses the canonical .mkv. |
failed_source_missing | Recovery exhausted — at least one source segment was missing from both local and the recovery-bridge S3. |
failed_permanent | Unrecoverable input (max retries exceeded, ledger invariants violated). |
Active DVR durability boundary
Section titled “Active DVR durability boundary”Per-segment S3 uploads stay as a recovery bridge, not playback infrastructure. Active DVR playback always reads the local rolling manifest on the recording origin (other edges DTSC-pull from that origin). The S3 segment objects exist only so that chapter finalization can recover from local segment loss (disk corruption, eviction edge case) and so the recording survives a recording-node loss until the chapter finalization queue produces the canonical .mkv.
Recording-node loss before chapter finalization can lose history depending on S3 recovery state; recording-node loss after a chapter reaches state='frozen' cannot.
Artifact statuses (DVR state machine)
Section titled “Artifact statuses (DVR state machine)”requested → starting → recording → finalizing ↓ ↓ └────────────→ completed | completed_partial | failed → deleted| Status | Meaning |
|---|---|
requested / starting | Sidecar accepting DVR start; transient |
recording | Active recording; Mist push running, segments writing to ledger |
finalizing | Stream ended; Foghorn running bounded retry on pending segments before classification |
completed | All segments uploaded; archive complete |
completed_partial | Some segments are lost_local; chapter finalization recovers from S3 where it can, otherwise the chapter is marked failed_source_missing |
failed | No playable segments; recording produced nothing usable |
deleted | Past retention; soft-deleted by the retention job |
Triage: what completed_partial means
Section titled “Triage: what completed_partial means”A completed_partial artifact has at least one segment that was force-evicted from local disk before the sidecar could upload it (disk pressure or retention edge case during recording). Chapter finalization first tries to recover those segments from S3; chapters that overlap an unrecoverable segment are marked failed_source_missing and produce no playback artifact (operator triage path).
To inspect:
SELECT segment_name, media_start_ms, media_end_ms, drop_reason, dropped_at FROM foghorn.dvr_segments WHERE artifact_hash = '<dvr_hash>' AND status = 'lost_local' ORDER BY media_start_ms;Cross-reference with disk-pressure metrics on the recording sidecar (storage_node_out_of_space lifecycle events) to find root cause.
Triage: chapter not appearing for a viewer
Section titled “Triage: chapter not appearing for a viewer”-
Confirm the recording has a chapter mode snapshot:
SELECT dvr_chapter_mode, dvr_chapter_interval, dvr_window_secondsFROM foghorn.artifactsWHERE artifact_hash = '<dvr_hash>';dvr_chapter_modeis snapshotted from the Stream at StartDVR. If it’s NULL, the recording was started with historical chapters off: the active rolling DVR window still records and serves live time-shift, but no finalized replay chapter artifacts will appear after media rolls out of that window. ConfiguredvrChapterModeon the Stream viaupdateStream; the change applies to the next recording, not this one. -
Inspect the chapter row’s state:
SELECT chapter_id, start_ms, end_ms, is_current, state,playback_artifact_hash, finalize_attempts, last_failure_reasonFROM foghorn.dvr_chaptersWHERE artifact_hash = '<dvr_hash>'ORDER BY start_ms DESC LIMIT 20; -
Map the state to action:
State What to do open/closedRecording still in flight or queue hasn’t picked up yet. Wait one finalization tick (30s). finalizingProcessing job dispatched. Check the recording origin’s Helmsman logs for chapter finalizeentries.finalized/frozen/reclaimedChapter is playable. Use the chapter’s playback_id(public Commodore-minted ID) as the player input.failed_source_missingSource segments lost from both local AND the recovery freeze. Chapter is unrecoverable. Inspect dvr_segments.status.failed_permanentRepeated retries exhausted ( finalize_attempts≥ 5) or invalid ledger input. Inspectlast_failure_reason.
Triage: chapter finalization stuck
Section titled “Triage: chapter finalization stuck”Chapter finalizations that hang in state='finalizing' past the dispatch deadline (default max(2*chapter_duration, 30 min) capped at 24h) are auto-re-queued by the next tick. To check manually:
SELECT chapter_id, state, finalize_attempts, last_failure_reason, created_at FROM foghorn.dvr_chapters WHERE state = 'finalizing' ORDER BY created_at;If finalize_attempts is climbing without state progress, the recording origin Helmsman is rejecting the job (admission, disk pressure, source unavailable). Check Helmsman logs for the chapter’s job_id (chapter-finalize-<chapter_id>).
Triage: storage node out of space during DVR
Section titled “Triage: storage node out of space during DVR”dropPressuredDVRSegmentson the recording sidecar will start evicting uploaded-and-aged segments first (safe local cleanup; the recovery-bridge S3 object remains intact).- If pressure persists, the sidecar may be forced to evict unsynced segments — these emit
DVRSegmentDropped(was_uploaded=false)and becomelost_localledger rows. Chapter finalization for overlapping chapters will surface this asfailed_source_missingonce recovery is exhausted. - If the sidecar can’t keep up at all, it stops the push cleanly and emits
storage_node_out_of_space— the artifact may end asfailed_storage_pressure.
Lifecycle event types to alert on:
DVRSegmentDropped(was_uploaded=false)— data loss event; investigate disk capacitystorage_node_out_of_space— push aborted; node needs attentiondvr_terminalrejection rate spiking — Foghorn-side state machine racing with sidecar; usually transient
S3 layout
Section titled “S3 layout”s3://bucket/dvr/{tenant_id}/{stream_internal_name}/{dvr_artifact_id}/segments/{segment_name}s3://bucket/vod/{tenant_id}/{chapter_artifact_hash}.mkvs3://bucket/vod/{tenant_id}/{chapter_artifact_hash}.dtshThe DVR segment prefix holds the recovery-bridge TS segments uploaded during recording. These are NOT playback objects: viewers never read them. Chapter finalization fetches them (or recovers from local disk) to build the canonical .mkv. Once a chapter reaches state='frozen' the reclaim sweep deletes both the local TS file and the corresponding S3 segment object.
Chapter playback artifacts live at the normal VOD prefix. They’re regular VOD artifacts whose origin_type='dvr_chapter', origin_id=chapter_id, and source stream_id are registered in Commodore. library_visible=false keeps them out of the tenant-wide VOD library, but stream-scoped VOD artifact queries can still return them for a stream’s media overview.
Don’t manually manipulate any of these prefixes; let the chapter finalization + reclaim sweep do their work.
Retention
Section titled “Retention”Retention only acts on terminal artifacts. An active DVR (status recording / finalizing) is never deleted by the retention job, regardless of age.
retention_until is computed at FinalizeDVR as ended_at + dvr_retention_days*24h, where dvr_retention_days is snapshotted on foghorn.artifacts at DVR start. Commodore resolves that value through the per-class cascade (per-stream override → tenant per-class default → 30-day system default → tier cap) before passing it to Foghorn; once stamped, the tenant’s plan is never re-resolved at end. A dvr_retention_days of 0 or NULL means “keep forever” and the retention job skips the artifact entirely.
If you need to extend retention on a specific terminal artifact, update retention_until directly:
UPDATE foghorn.artifacts SET retention_until = NOW() + INTERVAL '90 days' WHERE artifact_hash = '<dvr_hash>' AND status IN ('completed', 'completed_partial');The retention job picks up the new value on its next tick (default hourly).