FrameWorks Docs | Blog

Introducing Skipper: AI Video Consultant

Sat, 07 Feb 2026 00:00:00 GMT

Every software platform needs a comprehensive API and docs for people to build on. Modern tools like Docusaurus and Astro Starlight make it easier than ever to publish high-quality documentation — write your content in Markdown, and the build spits out something that looks good, with search, themes, and translation built in.

But the way people interact with docs is changing. With the rise of agentic assistants, developers increasingly point an AI at your docs and let it figure things out. What used to take two engineers two weeks — read the docs, build an integration, wire up a UI — now takes one agent a few hours to crawl, plan, and ship. The docs aren’t just for humans anymore; they’re a knowledge source for agents too.

That shift changes what a platform should offer. Static docs are table stakes. What developers actually need is a consultant that understands the domain, can look things up across dozens of sources, run live diagnostics on their infrastructure, and guide or execute authenticated platform workflows where that surface allows mutations — all while being honest about what it knows versus what it’s guessing.

Meet Skipper — the AI video consultant built into FrameWorks.

Skipper connects through the same MCP gateway that external agents use. The same GraphQL API, the same tool registry, the same authorization checks. In the dashboard, Skipper can use your authenticated account context for streams, diagnostics, and billing. External agents can use the Gateway MCP endpoint directly, while ask_consultant runs the Skipper pipeline with mutation tools blocked inside the consultant call.

What Skipper Does

That gateway access means Skipper doesn’t just answer questions — in the authenticated dashboard path, it can use platform tools for actions. In docs mode and through MCP ask_consultant, mutation tools are intentionally blocked; external agents should call dedicated Gateway MCP tools directly when they need to change resources.

Knowledge base search. Skipper searches curated documentation across 10+ streaming domains: FrameWorks, MistServer, FFmpeg, OBS, SRT, HLS, nginx-rtmp, DASH, WebRTC, and more. Searches are semantic, not keyword-matched. Source-backed answers include citations back to the source material.

Live stream diagnostics. Skipper connects to your running streams and analyzes real data. Rebuffering patterns, packet loss, buffer health, routing decisions, anomaly detection — it can diagnose problems that would take you 30 minutes to find manually.

Stream management. In authenticated dashboard chat, Skipper can create streams, refresh keys, make clips, start DVR recording, and upload VOD assets through the same GraphQL API you’d use directly, but conversationally. Through MCP, use ask_consultant for guidance and the dedicated mutation tools for state changes.

API introspection. Building an integration? Skipper reads your GraphQL schema, generates ready-to-use queries, and can wrap them in working code snippets. It introspects the schema live — so the output always matches your current API version.

Billing and payments. Check your balance, initiate supported crypto top-ups, or resolve x402 payment flows from authenticated dashboard chat or the dedicated Gateway MCP billing/payment tools.

Confidence Tagging

Every Skipper response is tagged with a confidence level:

Level	Meaning
Verified	Confirmed from official FrameWorks documentation or tested procedures
Sourced	Cited from external documentation (OBS, FFmpeg, etc.) with references
Best guess	Inferred from general knowledge — verify before acting on it
Unknown	Couldn’t validate from available sources

This isn’t cosmetic. When Skipper generates a Python snippet from your schema, it knows that the GraphQL query is verified (generated from the real schema) but the Python wrapper is best guess (LLM-generated code). The tag tells you where to focus your review.

Tool Calls

Skipper doesn’t just generate text. It uses 30+ MCP tools organized across stream management, QoE diagnostics, billing, knowledge search, API introspection, and more. When Skipper runs a tool, you see exactly what happened:

diagnose_rebuffering — analyzed your stream’s rebuffer ratio, found the root cause
generate_query — built a createStream mutation from your live schema
create_clip — clipped the last 30 seconds of your broadcast

Tool results render in specialized cards: diagnostic cards show health status with metrics and recommendations, code cards show GraphQL with copy buttons, stream cards display keys with show/hide toggles.

Heartbeat Monitoring

Skipper can run periodic health checks on active streams for eligible tenants. Every 30 minutes by default (configurable with HEARTBEAT_INTERVAL), it fetches stream health and client QoE metrics from Periscope, compares them against Welford running baselines it builds over time, and runs a deterministic triage:

Threshold violations (hard limits on rebuffer ratio, packet loss, etc.) trigger an investigation
Cross-metric correlations — 5 failure hypotheses (network degradation, encoder overload, viewer-side issues, ingest instability, CDN pressure) are matched against deviation patterns. Confidence ≥ 50% triggers investigation.
Baseline deviations — metrics exceeding 2σ from the running mean are flagged for review

When investigation is warranted, Skipper runs the full orchestrator with the diagnostic context — deviations, correlations, per-stream anomalies — and produces a structured report with root cause and recommendations. Healthy tenants consume zero LLM calls per cycle.

Per-stream drill-down identifies the most anomalous individual streams (up to 20) by comparing each against the tenant-wide baseline.

Infrastructure monitoring checks node-level CPU, memory, and disk across all clusters, with persistence confirmation to prevent transient spikes from triggering alerts. Notifications go out via configured email, WebSocket, or MCP channels.

Skipper can draft social media posts from platform events when SKIPPER_SOCIAL_ENABLED=true and SKIPPER_SOCIAL_NOTIFY_EMAIL is configured. The heartbeat agent and knowledge crawler push signals into an event collector — viewer records, bandwidth milestones, federation metrics, newly embedded documentation pages. A detector classifies and scores these signals against stored baselines.

The top-scoring signal is composed into a tweet by the utility LLM (max 280 characters, references recent posts to avoid theme repetition). Posts are saved as drafts and emailed to a configured address for human review before publishing. Configurable daily limit (default 2) and check interval (default 2h).

Where to Use It

Skipper is available in three places — all backed by the same MCP gateway:

Dashboard — authenticated chat at /skipper with account-scoped diagnostics and management actions where enabled.
Docs site — authenticated floating widget (Cmd+J). Read-only mode: knowledge search, schema introspection, and diagnostics. No mutations.
Your own agents — connect Claude Code, OpenClaw, or any MCP client to the Gateway endpoint. Use ask_consultant for full-pipeline read/diagnostic answers with confidence tagging alongside the Gateway’s dedicated platform tools.

Architecture

Skipper runs as a hub-and-spoke MCP setup. The Gateway MCP acts as the hub with 30+ tools. Skipper is a spoke that both consumes Gateway tools (diagnostics, stream management) and provides tools back (knowledge search, web search).

graph TD
    subgraph "Ingestion (scheduled)"
        CRAWL["Crawl Sitemaps + Direct Pages"] --> DETECT["SPA Detection + Headless Chrome"]
        DETECT --> EXTRACT["Content Extraction<br/><small>Readability → Markdown</small>"]
        EXTRACT --> CHUNK["Chunk ~500 tokens"]
        CHUNK --> EMBED["Embed"]
        EMBED --> PG[("pgvector")]
    end
    subgraph "Query Time"
        Q["User Query"] --> EMB["Embed Query"]
        EMB --> SEARCH["Hybrid Search<br/><small>vector + full-text</small>"]
        PG -.-> SEARCH
        SEARCH --> RERANK["Rerank + Deduplicate"]
        RERANK --> LLM["LLM + 30 MCP Tools"]
        LLM --> CONF["Confidence Tagging"]
        CONF --> RESP["Response with Citations"]
    end

The knowledge base is built by a scheduled crawler that indexes documentation across 10+ streaming domains. The crawler handles both static sites and JavaScript-heavy SPAs — it auto-detects pages that need rendering and processes them in headless Chrome. Content extraction uses Mozilla’s Readability algorithm to strip navigation and boilerplate, and three layers of change detection (source TTL, HTTP 304, content hashing) avoid re-embedding unchanged pages.

At query time, retrieval combines vector cosine similarity with PostgreSQL full-text search, followed by a reranking pass and per-source deduplication. Web search falls back to Tavily/Brave/SearXNG for topics not covered in the knowledge base.

Every conversation is persisted with token counts, confidence levels, source metadata when available, and tool call history — so you can pick up where you left off.

Try It

Skipper Docs — full reference
Open Dashboard — start chatting where Skipper is enabled for your account
Press Cmd+J on any docs page to open the widget

Skipper is available where it is enabled for an eligible account or deployment. Chat access is controlled by the configured tier and rate limits; standard usage billing applies for any stream operations Skipper performs on your behalf.

What Seven AI Agents Found in Our Streaming Platform

Sat, 31 Jan 2026 00:00:00 GMT

Live video systems collect awkward edge cases.

A viewer connects from a location that sits on the wrong side of a geofence boundary. A DVR recording starts at the same moment a stream is shutting down. Two usage events for the same tenant arrive close enough together that billing code has to prove it is actually idempotent. None of those cases are hard to understand on their own. The hard part is that they do not stay on their own.

FrameWorks is split across routing, ingest, VOD, edge orchestration, analytics, billing, auth, MCP agent access, DNS, Skipper, WebSocket routing, and tenant management. Each subsystem has tests and reviews, but the interesting failures tend to sit between them. They show up when a stream lifecycle event crosses into billing, or when a viewer-routing decision depends on data that is still warming in a cache.

We wanted a way to keep looking for those failures after the initial implementation work was done. Not a one-off security audit, and not a big rewrite disguised as process. Just a repeatable way to ask: which parts of the platform have not been reviewed recently, what can go wrong there, and can we prove it from the code?

So we ran a manually orchestrated audit using seven agent roles.

How the audit worked

The process was intentionally simple. One agent picked a domain to inspect. Another traced the relevant code paths and wrote findings. A separate reviewer checked those findings against the repository. If a fix made sense, another agent implemented it, a reviewer looked at the pull request, and a final pass addressed review feedback before CI and human merge.

The roles were less important than the separation between them. The agent that found a bug did not get to declare the fix correct. The agent that wrote the patch did not review its own work. The human still had the last merge decision.

We also made one rule non-negotiable: every finding had to cite evidence. A vague claim like “there may be a race condition here” was not enough. The report had to point to the files involved, describe the interleaving or input that triggered the issue, explain the impact, and propose a concrete fix.

That made the process much more useful. It also filtered out a lot of confident nonsense.

Why agents helped

The strongest use case was not “AI replaces engineers.” It was much narrower: agents are good at patiently tracing boring paths through a large codebase.

For example, a human reviewer might look at the billing handler, confirm the obvious transaction boundary, and move on. An audit agent can keep following the event backward into Kafka consumers, forward into ClickHouse writes, sideways into tenant scoping, and then ask what happens if two messages arrive for the same tenant in the same small window.

That kind of review is tedious. It is also where a lot of production bugs live.

The agents were useful in three places:

Breadth. They could cover many subsystems without requiring one person to keep the entire platform in their head all week.
Patience. They did not mind following a stream from API request to database write to service event to async consumer.
Second opinions. A separate review pass caught findings that sounded plausible but did not match the actual code or design intent.

The last point mattered most. LLMs can misunderstand why a system is built a certain way. Sometimes code that looks suspicious is an intentional trade-off. Sometimes an apparent race is already prevented by a lock one layer down. The review pass forced us to separate “this looks scary” from “this is actually broken.”

What we found

We ran the first version across 12 subsystems in 26 batches. Each task was scoped to a few hours of agent work, and the batches were small enough that a human could still review the output without drowning in it.

The audit completed 90 tasks across the platform. The most common findings were exactly the kinds of things we expected in a multi-tenant streaming system:

Category	Count	Example
Race conditions	12	Prepaid balance deduction under concurrent Kafka messages
Tenant isolation	8	Stream context cache keyed by `internal_name` only, missing `tenant_id`
Data loss risks	7	Decklog batch flush interrupted by producer crash
Stale cache behavior	6	GeoIP cache stampede under bursty traffic
Protocol edge cases	5	Player protocol blacklist leading to “no playable protocol” dead-end
Storage consistency	5	S3 upload succeeds but local delete fails, leaving duplicate artifacts
Auth bypass vectors	4	GraphQL complexity bypass through deep nesting with small page sizes
DNS propagation	3	Stale DNS pointing at decommissioned nodes

The most valuable findings were not always the highest severity. A few were boring but important tenant-isolation checks. A few were “this is safe, and here is why” confirmations that turned implicit assumptions into documented invariants. Those are useful too, because future changes have something concrete to preserve.

What changed in the code review culture

The audit changed the questions we ask during review.

“Does this query filter by tenant?” became a first-class check, not an afterthought. “Is this operation idempotent if Kafka retries it?” started showing up in reviews outside the audit. “What happens if this cache key collides across tenants?” became easier to ask because we had examples from real findings.

That is the part of the experiment we would keep even without the agents. A good audit leaves behind sharper engineering habits.

It also made us more careful about severity. Agents are tempted to overstate risk because “critical issue found” sounds more useful than “possible edge case worth testing.” The review stage pushed severity back down when the evidence did not support it. That made the high-severity findings easier to take seriously.

What we would do differently

The first run was too verbose. Some reports included too much process, too many proposed formats, and too much architecture explanation. That is useful while designing an internal workflow, but it is not useful when the goal is to decide whether a bug is real.

Next time, we would make the reports shorter:

one paragraph for the claim
exact files and functions involved
the failing interleaving or input
the user or tenant impact
the smallest reasonable fix
the test that proves the fix

We would also keep batch sizes small. Three tasks per batch was about right. Larger batches create a review queue, and the human reviewer becomes the bottleneck. Smaller batches make the pipeline feel busy without producing enough useful coverage.

Where this goes next

The manual version proved that the process is worth keeping, but we do not want a pile of bespoke prompts and hand-run tasks to become part of how FrameWorks operates.

The next step is lightweight automation: track when each subsystem was last audited, prioritize higher-risk areas like billing and auth, and only open work when there is a useful review to run. The goal is not to have agents constantly changing code. The goal is to keep stale, risky parts of the platform from going unexamined for too long.

We already use a similar idea in Skipper’s runtime monitoring. Most checks should be quiet. They wake up, inspect the current state, and do nothing unless something actually needs attention. The audit process should behave the same way.

AI agents were helpful here because they gave us more review coverage, not because they removed judgment from the system. The useful pattern was evidence first, independent review second, automation third, and human merge last.

That is less flashy than “seven agents fixed our platform.” It is also much closer to how we want to build software.

Welcome to the FrameWorks Blog

Fri, 06 Dec 2024 00:00:00 GMT

Most streaming platforms keep their architecture behind closed doors. You get API docs, a dashboard, and a support email. How the ingest pipeline actually works, how viewer routing decisions are made, how billing enforcement interacts with stream lifecycle — that stays internal.

We think that’s backwards. If you’re trusting a platform with your live video, you should understand how it works. Not just the API surface, but the design decisions underneath.

This blog is where we write about that. Expect:

Architecture deep-dives — how subsystems are designed, why we made the trade-offs we did, and what we’d do differently
Feature launches — not marketing copy, but technical explanations of what shipped and how it works under the hood
Engineering process — how we test, audit, and operate a multi-tenant streaming platform

No release cadence, no filler. We publish when there’s something worth reading.