Skip to content

24/7 DVR — Operations

FrameWorks ships a continuous DVR archive: one DVR artifact per stream session, with no per-artifact lifetime cap. Live viewers seek across a tier-bounded rolling Mist window. Replay viewers navigate the recording sliced into chapter VOD artifacts that the chapter finalization queue produces from the per-segment ledger (foghorn.dvr_segments) at each boundary close. Chapter playback uses the chapter artifact’s normal VOD playback path — there is no dvr+<chapter_id> token anymore.

If you’re triaging a DVR issue, the engineering reference is docs/architecture/dvr-continuous-archive.md. This page is the operator-facing surface.

The Foghorn process running in each cluster reads two env vars to ceiling the DVR window/manifest size for tenants on that cluster:

Env varMeaningDefault
DVR_CLUSTER_MAX_WINDOW_SECONDSMaximum live DVR window any tenant on this cluster can request, in seconds0 (no cluster cap; tier ceilings stand)
DVR_CLUSTER_MAX_ENTRIESMaximum HLS playlist entries (rolling-window manifest size)0 (no cap)

Set these in the cluster’s gitops env file. Topology assumption: one Foghorn process per cluster, so process env IS the per-cluster surface.

Enterprise tenants have dvr_allow_cluster_extension=true in their tier entitlements — for them the cluster setting can raise their max DVR window up to the platform ceiling (72h). For non-enterprise tenants the cluster setting only acts as a ceiling.

Resolved at DVR start by pkg/dvrpolicy.Resolve. Snapshot stored on foghorn.artifacts.dvr_window_seconds so tier changes mid-stream don’t affect the in-flight session.

TierDefault windowMax windowSegment durationMax manifest entries
Free30 min1 h6s600
Supporter2 h6 h6s3,600
Developer4 h12 h6s7,200
Production4 h1 d12s7,200
Enterprise4 h3 d (cluster opt-in)24s10,800

Defined in api_billing/internal/bootstrap/catalog/billing_tiers.yaml as tier entitlements.

Three Foghorn-side workers run the chapter lifecycle:

WorkerCadenceJob
chapter_sweeper.go60sRotates chapter boundaries on active DVRs: closes the open chapter, opens the next.
chapter_finalization_queue.go30sPicks up state='closed' chapters and dispatches a dvr_chapter_finalize processing job to the recording origin.
chapter_reclaim_sweep.go60sOnce a chapter reaches state='frozen', deletes its source TS segments locally and the recovery-bridge S3 objects.
open → closed → finalizing → finalized → frozen → reclaimed
↓ ↓
└───→ failed_source_missing | failed_permanent
StateMeaning
openRecording in progress; the rolling DVR surface serves viewers.
closedBoundary reached; finalization queue will pick it up.
finalizingProcessing job in flight on the recording origin (Mist remuxes TS → canonical .mkv).
finalizedPUSH_END fired; chapter artifact exists locally. Waiting on freeze + .dtsh sync.
frozenChapter artifact + .dtsh durably on S3. Safe to reclaim source segments.
reclaimedSource segments deleted; row remains as range metadata. Playback uses the canonical .mkv.
failed_source_missingRecovery exhausted — at least one source segment was missing from both local and the recovery-bridge S3.
failed_permanentUnrecoverable input (max retries exceeded, ledger invariants violated).

Per-segment S3 uploads stay as a recovery bridge, not playback infrastructure. Active DVR playback always reads the local rolling manifest on the recording origin (other edges DTSC-pull from that origin). The S3 segment objects exist only so that chapter finalization can recover from local segment loss (disk corruption, eviction edge case) and so the recording survives a recording-node loss until the chapter finalization queue produces the canonical .mkv.

Recording-node loss before chapter finalization can lose history depending on S3 recovery state; recording-node loss after a chapter reaches state='frozen' cannot.

requested → starting → recording → finalizing
↓ ↓
└────────────→ completed | completed_partial | failed → deleted
StatusMeaning
requested / startingSidecar accepting DVR start; transient
recordingActive recording; Mist push running, segments writing to ledger
finalizingStream ended; Foghorn running bounded retry on pending segments before classification
completedAll segments uploaded; archive complete
completed_partialSome segments are lost_local; chapter finalization recovers from S3 where it can, otherwise the chapter is marked failed_source_missing
failedNo playable segments; recording produced nothing usable
deletedPast retention; soft-deleted by the retention job

A completed_partial artifact has at least one segment that was force-evicted from local disk before the sidecar could upload it (disk pressure or retention edge case during recording). Chapter finalization first tries to recover those segments from S3; chapters that overlap an unrecoverable segment are marked failed_source_missing and produce no playback artifact (operator triage path).

To inspect:

SELECT segment_name, media_start_ms, media_end_ms, drop_reason, dropped_at
FROM foghorn.dvr_segments
WHERE artifact_hash = '<dvr_hash>'
AND status = 'lost_local'
ORDER BY media_start_ms;

Cross-reference with disk-pressure metrics on the recording sidecar (storage_node_out_of_space lifecycle events) to find root cause.

Triage: chapter not appearing for a viewer

Section titled “Triage: chapter not appearing for a viewer”
  1. Confirm the recording has a chapter mode snapshot:

    SELECT dvr_chapter_mode, dvr_chapter_interval, dvr_window_seconds
    FROM foghorn.artifacts
    WHERE artifact_hash = '<dvr_hash>';

    dvr_chapter_mode is snapshotted from the Stream at StartDVR. If it’s NULL, the recording was started with historical chapters off: the active rolling DVR window still records and serves live time-shift, but no finalized replay chapter artifacts will appear after media rolls out of that window. Configure dvrChapterMode on the Stream via updateStream; the change applies to the next recording, not this one.

  2. Inspect the chapter row’s state:

    SELECT chapter_id, start_ms, end_ms, is_current, state,
    playback_artifact_hash, finalize_attempts, last_failure_reason
    FROM foghorn.dvr_chapters
    WHERE artifact_hash = '<dvr_hash>'
    ORDER BY start_ms DESC LIMIT 20;
  3. Map the state to action:

    StateWhat to do
    open / closedRecording still in flight or queue hasn’t picked up yet. Wait one finalization tick (30s).
    finalizingProcessing job dispatched. Check the recording origin’s Helmsman logs for chapter finalize entries.
    finalized / frozen / reclaimedChapter is playable. Use the chapter’s playback_id (public Commodore-minted ID) as the player input.
    failed_source_missingSource segments lost from both local AND the recovery freeze. Chapter is unrecoverable. Inspect dvr_segments.status.
    failed_permanentRepeated retries exhausted (finalize_attempts ≥ 5) or invalid ledger input. Inspect last_failure_reason.

Chapter finalizations that hang in state='finalizing' past the dispatch deadline (default max(2*chapter_duration, 30 min) capped at 24h) are auto-re-queued by the next tick. To check manually:

SELECT chapter_id, state, finalize_attempts, last_failure_reason, created_at
FROM foghorn.dvr_chapters
WHERE state = 'finalizing'
ORDER BY created_at;

If finalize_attempts is climbing without state progress, the recording origin Helmsman is rejecting the job (admission, disk pressure, source unavailable). Check Helmsman logs for the chapter’s job_id (chapter-finalize-<chapter_id>).

Triage: storage node out of space during DVR

Section titled “Triage: storage node out of space during DVR”
  1. dropPressuredDVRSegments on the recording sidecar will start evicting uploaded-and-aged segments first (safe local cleanup; the recovery-bridge S3 object remains intact).
  2. If pressure persists, the sidecar may be forced to evict unsynced segments — these emit DVRSegmentDropped(was_uploaded=false) and become lost_local ledger rows. Chapter finalization for overlapping chapters will surface this as failed_source_missing once recovery is exhausted.
  3. If the sidecar can’t keep up at all, it stops the push cleanly and emits storage_node_out_of_space — the artifact may end as failed_storage_pressure.

Lifecycle event types to alert on:

  • DVRSegmentDropped(was_uploaded=false) — data loss event; investigate disk capacity
  • storage_node_out_of_space — push aborted; node needs attention
  • dvr_terminal rejection rate spiking — Foghorn-side state machine racing with sidecar; usually transient
s3://bucket/dvr/{tenant_id}/{stream_internal_name}/{dvr_artifact_id}/segments/{segment_name}
s3://bucket/vod/{tenant_id}/{chapter_artifact_hash}.mkv
s3://bucket/vod/{tenant_id}/{chapter_artifact_hash}.dtsh

The DVR segment prefix holds the recovery-bridge TS segments uploaded during recording. These are NOT playback objects: viewers never read them. Chapter finalization fetches them (or recovers from local disk) to build the canonical .mkv. Once a chapter reaches state='frozen' the reclaim sweep deletes both the local TS file and the corresponding S3 segment object.

Chapter playback artifacts live at the normal VOD prefix. They’re regular VOD artifacts whose origin_type='dvr_chapter', origin_id=chapter_id, and source stream_id are registered in Commodore. library_visible=false keeps them out of the tenant-wide VOD library, but stream-scoped VOD artifact queries can still return them for a stream’s media overview.

Don’t manually manipulate any of these prefixes; let the chapter finalization + reclaim sweep do their work.

Retention only acts on terminal artifacts. An active DVR (status recording / finalizing) is never deleted by the retention job, regardless of age.

retention_until is computed at FinalizeDVR as ended_at + dvr_retention_days*24h, where dvr_retention_days is snapshotted on foghorn.artifacts at DVR start. Commodore resolves that value through the per-class cascade (per-stream override → tenant per-class default → 30-day system default → tier cap) before passing it to Foghorn; once stamped, the tenant’s plan is never re-resolved at end. A dvr_retention_days of 0 or NULL means “keep forever” and the retention job skips the artifact entirely.

If you need to extend retention on a specific terminal artifact, update retention_until directly:

UPDATE foghorn.artifacts
SET retention_until = NOW() + INTERVAL '90 days'
WHERE artifact_hash = '<dvr_hash>'
AND status IN ('completed', 'completed_partial');

The retention job picks up the new value on its next tick (default hourly).