24/7 DVR — Operations

What you’re operating

FrameWorks ships a continuous DVR archive: one DVR artifact per stream session, with no per-artifact lifetime cap. Live viewers seek across a tier-bounded rolling Mist window. Replay viewers navigate the recording sliced into chapter VOD artifacts that the chapter finalization queue produces from the per-segment ledger (foghorn.dvr_segments) at each boundary close. Chapter playback uses the chapter artifact’s normal VOD playback path — there is no dvr+<chapter_id> token anymore.

If you’re triaging a DVR issue, the engineering reference is docs/architecture/dvr-continuous-archive.md. This page is the operator-facing surface.

Cluster ceilings — env vars

The Foghorn process running in each cluster reads two env vars to ceiling the DVR window/manifest size for tenants on that cluster:

Env var	Meaning	Default
`DVR_CLUSTER_MAX_WINDOW_SECONDS`	Maximum live DVR window any tenant on this cluster can request, in seconds	`0` (no cluster cap; tier ceilings stand)
`DVR_CLUSTER_MAX_ENTRIES`	Maximum HLS playlist entries (rolling-window manifest size)	`0` (no cap)

Set these in the cluster’s gitops env file. Topology assumption: one Foghorn process per cluster, so process env IS the per-cluster surface.

Enterprise tenants have dvr_allow_cluster_extension=true in their tier entitlements — for them the cluster setting can raise their max DVR window up to the platform ceiling (72h). For non-enterprise tenants the cluster setting only acts as a ceiling.

Tier defaults

Resolved at DVR start by pkg/dvrpolicy.Resolve. Snapshot stored on foghorn.artifacts.dvr_window_seconds so tier changes mid-stream don’t affect the in-flight session.

Tier	Default window	Max window	Segment duration	Max manifest entries
Free	30 min	1 h	6s	600
Supporter	2 h	6 h	6s	3,600
Developer	4 h	12 h	6s	7,200
Production	4 h	1 d	12s	7,200
Enterprise	4 h	3 d (cluster opt-in)	24s	10,800

Defined in api_billing/internal/bootstrap/catalog/billing_tiers.yaml as tier entitlements.

Chapter pipeline overview

Three Foghorn-side workers run the chapter lifecycle:

Worker	Cadence	Job
`chapter_sweeper.go`	60s	Rotates chapter boundaries on active DVRs: closes the open chapter, opens the next.
`chapter_finalization_queue.go`	30s	Picks up `state='closed'` chapters and dispatches a `dvr_chapter_finalize` processing job to the recording origin.
`chapter_reclaim_sweep.go`	60s	Once a chapter reaches `state='frozen'`, deletes its source TS segments locally and the recovery-bridge S3 objects.

Chapter state machine

open → closed → finalizing → finalized → frozen → reclaimed
           ↓        ↓
           └───→ failed_source_missing | failed_permanent

State	Meaning
`open`	Recording in progress; the rolling DVR surface serves viewers.
`closed`	Boundary reached; finalization queue will pick it up.
`finalizing`	Processing job in flight on the recording origin (Mist remuxes TS → canonical .mkv).
`finalized`	PUSH_END fired; chapter artifact exists locally. Waiting on freeze + .dtsh sync.
`frozen`	Chapter artifact + .dtsh durably on S3. Safe to reclaim source segments.
`reclaimed`	Source segments deleted; row remains as range metadata. Playback uses the canonical .mkv.
`failed_source_missing`	Recovery exhausted — at least one source segment was missing from both local and the recovery-bridge S3.
`failed_permanent`	Unrecoverable input (max retries exceeded, ledger invariants violated).

Active DVR durability boundary

Per-segment S3 uploads stay as a recovery bridge, not playback infrastructure. Active DVR playback always reads the local rolling manifest on the recording origin (other edges DTSC-pull from that origin). The S3 segment objects exist only so that chapter finalization can recover from local segment loss (disk corruption, eviction edge case) and so the recording survives a recording-node loss until the chapter finalization queue produces the canonical .mkv.

Recording-node loss before chapter finalization can lose history depending on S3 recovery state; recording-node loss after a chapter reaches state='frozen' cannot.

Artifact statuses (DVR state machine)

requested → starting → recording → finalizing
              ↓             ↓
              └────────────→ completed | completed_partial | failed → deleted

Status	Meaning
`requested` / `starting`	Sidecar accepting DVR start; transient
`recording`	Active recording; Mist push running, segments writing to ledger
`finalizing`	Stream ended; Foghorn running bounded retry on pending segments before classification
`completed`	All segments uploaded; archive complete
`completed_partial`	Some segments are `lost_local`; chapter finalization recovers from S3 where it can, otherwise the chapter is marked `failed_source_missing`
`failed`	No playable segments; recording produced nothing usable
`deleted`	Past retention; soft-deleted by the retention job

Triage: what `completed_partial` means

A completed_partial artifact has at least one segment that was force-evicted from local disk before the sidecar could upload it (disk pressure or retention edge case during recording). Chapter finalization first tries to recover those segments from S3; chapters that overlap an unrecoverable segment are marked failed_source_missing and produce no playback artifact (operator triage path).

To inspect:

SELECT segment_name, media_start_ms, media_end_ms, drop_reason, dropped_at
  FROM foghorn.dvr_segments
 WHERE artifact_hash = '<dvr_hash>'
   AND status = 'lost_local'
 ORDER BY media_start_ms;

Cross-reference with disk-pressure metrics on the recording sidecar (storage_node_out_of_space lifecycle events) to find root cause.

Triage: chapter not appearing for a viewer

Confirm the recording has a chapter mode snapshot:
```
SELECT dvr_chapter_mode, dvr_chapter_interval, dvr_window_seconds
  FROM foghorn.artifacts
 WHERE artifact_hash = '<dvr_hash>';
```
dvr_chapter_mode is snapshotted from the Stream at StartDVR. If it’s NULL, the recording was started with historical chapters off: the active rolling DVR window still records and serves live time-shift, but no finalized replay chapter artifacts will appear after media rolls out of that window. Configure dvrChapterMode on the Stream via updateStream; the change applies to the next recording, not this one.

Inspect the chapter row’s state:

SELECT chapter_id, start_ms, end_ms, is_current, state,
       playback_artifact_hash, finalize_attempts, last_failure_reason
  FROM foghorn.dvr_chapters
 WHERE artifact_hash = '<dvr_hash>'
 ORDER BY start_ms DESC LIMIT 20;

Map the state to action:

State	What to do
`open` / `closed`	Recording still in flight or queue hasn’t picked up yet. Wait one finalization tick (30s).
`finalizing`	Processing job dispatched. Check the recording origin’s Helmsman logs for `chapter finalize` entries.
`finalized` / `frozen` / `reclaimed`	Chapter is playable. Use the chapter’s `playback_id` (public Commodore-minted ID) as the player input.
`failed_source_missing`	Source segments lost from both local AND the recovery freeze. Chapter is unrecoverable. Inspect `dvr_segments.status`.
`failed_permanent`	Repeated retries exhausted (`finalize_attempts` ≥ 5) or invalid ledger input. Inspect `last_failure_reason`.

Triage: chapter finalization stuck

Chapter finalizations that hang in state='finalizing' past the dispatch deadline (default max(2*chapter_duration, 30 min) capped at 24h) are auto-re-queued by the next tick. To check manually:

SELECT chapter_id, state, finalize_attempts, last_failure_reason, created_at
  FROM foghorn.dvr_chapters
 WHERE state = 'finalizing'
 ORDER BY created_at;

If finalize_attempts is climbing without state progress, the recording origin Helmsman is rejecting the job (admission, disk pressure, source unavailable). Check Helmsman logs for the chapter’s job_id (chapter-finalize-<chapter_id>).

Triage: storage node out of space during DVR

dropPressuredDVRSegments on the recording sidecar will start evicting uploaded-and-aged segments first (safe local cleanup; the recovery-bridge S3 object remains intact).
If pressure persists, the sidecar may be forced to evict unsynced segments — these emit DVRSegmentDropped(was_uploaded=false) and become lost_local ledger rows. Chapter finalization for overlapping chapters will surface this as failed_source_missing once recovery is exhausted.
If the sidecar can’t keep up at all, it stops the push cleanly and emits storage_node_out_of_space — the artifact may end as failed_storage_pressure.

Lifecycle event types to alert on:

DVRSegmentDropped(was_uploaded=false) — data loss event; investigate disk capacity
storage_node_out_of_space — push aborted; node needs attention
dvr_terminal rejection rate spiking — Foghorn-side state machine racing with sidecar; usually transient

S3 layout

s3://bucket/dvr/{tenant_id}/{stream_internal_name}/{dvr_artifact_id}/segments/{segment_name}
s3://bucket/vod/{tenant_id}/{chapter_artifact_hash}.mkv
s3://bucket/vod/{tenant_id}/{chapter_artifact_hash}.dtsh

The DVR segment prefix holds the recovery-bridge TS segments uploaded during recording. These are NOT playback objects: viewers never read them. Chapter finalization fetches them (or recovers from local disk) to build the canonical .mkv. Once a chapter reaches state='frozen' the reclaim sweep deletes both the local TS file and the corresponding S3 segment object.

Chapter playback artifacts live at the normal VOD prefix. They’re regular VOD artifacts whose origin_type='dvr_chapter', origin_id=chapter_id, and source stream_id are registered in Commodore. library_visible=false keeps them out of the tenant-wide VOD library, but stream-scoped VOD artifact queries can still return them for a stream’s media overview.

Don’t manually manipulate any of these prefixes; let the chapter finalization + reclaim sweep do their work.

Retention

Retention only acts on terminal artifacts. An active DVR (status recording / finalizing) is never deleted by the retention job, regardless of age.

retention_until is computed at FinalizeDVR as ended_at + dvr_retention_days*24h, where dvr_retention_days is snapshotted on foghorn.artifacts at DVR start. Commodore resolves that value through the per-class cascade (per-stream override → tenant per-class default → 30-day system default → tier cap) before passing it to Foghorn; once stamped, the tenant’s plan is never re-resolved at end. A dvr_retention_days of 0 or NULL means “keep forever” and the retention job skips the artifact entirely.

If you need to extend retention on a specific terminal artifact, update retention_until directly:

UPDATE foghorn.artifacts
   SET retention_until = NOW() + INTERVAL '90 days'
 WHERE artifact_hash = '<dvr_hash>'
   AND status IN ('completed', 'completed_partial');

The retention job picks up the new value on its next tick (default hourly).