Skipper Setup

Skipper requires three things: an LLM for chat, an embedding model for the knowledge base, and PostgreSQL with pgvector. Optionally, add a web search provider for live lookups beyond the knowledge base.

All configuration is through environment variables. The LLM and embedding providers are independent — you can mix local embeddings with an API-hosted LLM.

Model Selection

Embeddings

Embedding models convert text into vectors for similarity search. Quality here affects every answer — bad retrieval poisons everything downstream.

Option	Run where	Notes
`nomic-embed-text`	Ollama / local CPU or GPU	Strong default for local deployments.
`bge-m3`	Ollama / local CPU or GPU	Multi-lingual, multi-granularity. Heavier but excellent.
Hosted embedding API	OpenAI-compatible provider	Simple operations path; verify current provider pricing.
Blueclaw / Livepeer-network AI	OpenAI-compatible or BYOC runtime	Good fit when you want network-provided inference/embeddings instead of separate SaaS APIs.

Recommendation: Start with local or network-provided embeddings if available. If retrieval quality is insufficient, try a higher-quality hosted embedding model and re-crawl.

LLM — Local (Ollama)

Self-hosted via Ollama. No per-token cost — you pay for hardware.

Tier 1 — Lightweight (7-8B) — single 8-16GB GPU

Model	Context	VRAM (Q4)	Notes
Qwen 3 8B	128K	~5GB	Strong instruction following + tool use
Llama 3.1 8B	128K	~5GB	Battle-tested, huge community
Mistral 7B v0.3	32K	~4.5GB	Fast, good at structured prompts

Good for simple lookups. Struggles with multi-hop reasoning across multiple chunks.

Tier 2 — Sweet Spot (14-32B) — single 24-48GB GPU (recommended)

Model	Context	VRAM (Q4)	Notes
Qwen 2.5 32B	128K	~20GB	Best all-around. Strong RAG + instruction following.
Qwen 3 32B	128K	~20GB	Built-in thinking mode. Good for reasoning.
Mistral Small 3.1 24B	128K	~15GB	Excellent function calling. Fits on a 24GB card.
Command R 35B	128K	~22GB	Purpose-built for RAG by Cohere. Native citations.

RTX 3090/4090 (24GB) runs Q4_K_M of 32B models at ~15-25 tok/s — fine for chat.

Tier 3 — Heavy (70B+) — 2x 24GB or 1x 80GB GPU

Model	Context	VRAM (Q4)	Notes
Llama 3.3 70B	128K	~40GB	Near 405B quality.
Qwen 2.5 72B	128K	~42GB	Excellent multilingual + code.

The jump from 32B to 70B is ~10-15% better on hard queries. Usually not worth 2-3x the hardware for a domain consultant with good RAG.

LLM — Hosted or Network-Provided

No local hardware needed. Skipper uses OpenAI-compatible chat and embedding endpoints, so hosted SaaS, Blueclaw, Livepeer-network inference, or a BYOC model container can all fit the same configuration shape when they expose compatible APIs.

For Livepeer-network deployments, check the current Livepeer AI docs and ask in the Livepeer Discord before planning around a custom model or BYOC container. Availability changes as model runners and orchestrator support evolve. BYOC is useful when the model you want is not exposed by a managed endpoint yet.

Option	Best For	Notes
Hosted SaaS model API	Fastest path to production	Verify current pricing and data policy directly with the provider.
Blueclaw	OpenAI-compatible gateway for agents	Can reduce vendor lock-in if it has the models you need.
Livepeer-network inference	Keeping inference spend and usage on-network	Good strategic fit when available for your target models.
Livepeer BYOC container	Custom models or runtime control	More operator work, but lets you bring models not exposed by default.

Cost Estimates

Skipper cost depends on:

conversation history included in the prompt
retrieved knowledge chunks and citations
tool-call count and tool result size
whether query rewriting, HyDE, reranking, or web search are enabled
provider pricing, model choice, and whether inference runs locally or on-network

Early planning shape:

Deployment shape	Cost profile	When to choose it
Local Ollama	Hardware and operations, no per-token API bill	Self-hosted clusters with spare GPU capacity.
Blueclaw / Livepeer-network AI	Network/provider pricing, less separate SaaS usage	Agents-first deployments and on-network usage goals.
Budget hosted SaaS model	Usually low at modest chat volume	Fast setup when answer quality requirements are modest.
Higher-quality hosted SaaS model	Can climb quickly with long prompts/tool results	Harder support cases or premium account tiers.

For budgeting, use Skipper’s recorded tokensInput, tokensOutput, tool calls, and provider/model fields as the source of truth. Real deployments should compare that usage data with current provider pricing or local inference costs before committing to a support-tier margin.

Configuration

Environment Variables

LLM:

Variable	Purpose	Default
`LLM_PROVIDER`	`openai`, `anthropic`, or `ollama`	—
`LLM_MODEL`	Model identifier	—
`LLM_API_KEY`	API credentials	—
`LLM_API_URL`	Custom endpoint (OpenRouter, self-hosted)	Provider default
`LLM_MAX_TOKENS`	Max output tokens per response	`4096`

Embeddings (falls back to LLM_* when unset):

Variable	Purpose	Default
`EMBEDDING_PROVIDER`	`openai` or `ollama`	`LLM_PROVIDER`
`EMBEDDING_MODEL`	Embedding model name	`LLM_MODEL`
`EMBEDDING_API_KEY`	Embedding API credentials	`LLM_API_KEY`
`EMBEDDING_API_URL`	Embedding endpoint	`LLM_API_URL`

Utility LLM (for background tasks like contextual retrieval; falls back to LLM_* when unset):

Variable	Purpose	Default
`UTILITY_LLM_PROVIDER`	Cheap LLM for background tasks (query rewriting, HyDE, contextual retrieval)	`LLM_PROVIDER`
`UTILITY_LLM_MODEL`	Utility model identifier	`LLM_MODEL`
`UTILITY_LLM_API_KEY`	Utility LLM credentials	`LLM_API_KEY`
`UTILITY_LLM_API_URL`	Utility LLM endpoint	`LLM_API_URL`

Web search (optional):

Variable	Purpose	Default
`SEARCH_PROVIDER`	`tavily`, `brave`, or `searxng`	—
`SEARCH_API_KEY`	Search API key (not needed for SearXNG)	—
`SEARCH_API_URL`	Custom endpoint (required for SearXNG)	Provider default

Retrieval quality:

Variable	Purpose	Default
`RERANKER_PROVIDER`	Cross-encoder reranker: `cohere`, `jina`, `voyage`, or `generic`	— (keyword fallback)
`RERANKER_MODEL`	Reranker model (e.g. `rerank-4-pro`, `rerank-2.5`, `jina-reranker-v2-base-multilingual`)	—
`RERANKER_API_KEY`	Reranker API credentials	`LLM_API_KEY`
`RERANKER_API_URL`	Reranker endpoint (required for `generic` provider)	Provider default
`SKIPPER_ENABLE_HYDE`	Enable Hypothetical Document Embeddings for `search_knowledge`	`false`

Knowledge base:

Variable	Purpose	Default
`SITEMAPS`	Comma-separated sitemap URLs	—
`SKIPPER_SITEMAPS_DIR`	Directory of source files (re-read each cycle)	—
`CRAWL_INTERVAL`	Refresh interval	`24h`
`CHUNK_TOKEN_LIMIT`	Max BPE tokens per chunk	`500`
`CHUNK_TOKEN_OVERLAP`	Overlap tokens between adjacent chunks	`50`
`SKIPPER_ENABLE_RENDERING`	Enable headless Chrome for JS-rendered pages	`false`
`SKIPPER_CONTEXTUAL_RETRIEVAL`	Use utility LLM to prepend context before embedding	`false`
`SKIPPER_LINK_DISCOVERY`	Discover and crawl same-domain links	`false`
`SKIPPER_SEARCH_LIMIT`	Default result limit for `search_knowledge`	`8`

Service:

Variable	Purpose	Default
`SKIPPER_WEB_UI`	Enable standalone web UI at `/`	`true`
`SKIPPER_API_KEY`	API key for admin WebUI auth	—
`SKIPPER_WEB_UI_INSECURE`	Allow the standalone WebUI without an API key	`false`
`SKIPPER_REQUIRED_TIER_LEVEL`	Minimum subscription tier	`3`
`SKIPPER_CHAT_RATE_LIMIT_PER_HOUR`	Rate limit per tenant	`0` (unlimited)
`SKIPPER_CHAT_RATE_LIMIT_OVERRIDES`	Per-tenant overrides (`tenant_id:limit,...`)	—
`SKIPPER_ADMIN_TENANT_ID`	Tenant ID for global/platform knowledge	—
`SKIPPER_MAX_HISTORY_MESSAGES`	Max conversation messages loaded per request	`20`
`GATEWAY_MCP_URL`	Internal Gateway MCP endpoint for platform tools	derived from Bridge mesh URL
`GATEWAY_PUBLIC_URL`	Public API Gateway base URL; fallback for MCP when `GATEWAY_MCP_URL` is unset	—

Social posting:

Variable	Purpose	Default
`SKIPPER_SOCIAL_ENABLED`	Enable event-driven social posting agent	`false`
`SKIPPER_SOCIAL_INTERVAL`	How often to check for noteworthy events	`2h`
`SKIPPER_SOCIAL_MAX_PER_DAY`	Max posts per day (`0` = unlimited)	`2`
`SKIPPER_SOCIAL_NOTIFY_EMAIL`	Email to send draft tweets to (required when enabled)	—

Configuration Examples

Zero ongoing cost. Requires a GPU for the LLM; embeddings run on CPU.

# LLM
LLM_PROVIDER=ollama
LLM_MODEL=qwen2.5:32b
LLM_API_URL=http://localhost:11434/v1

# Utility LLM — cheap model for background tasks (contextual retrieval)
UTILITY_LLM_PROVIDER=ollama
UTILITY_LLM_MODEL=qwen2.5:7b
UTILITY_LLM_API_URL=http://localhost:11434/v1

# Embeddings (same Ollama instance, different endpoint)
EMBEDDING_PROVIDER=ollama
EMBEDDING_MODEL=nomic-embed-text
EMBEDDING_API_URL=http://localhost:11434

# Web search (optional, self-hosted)
SEARCH_PROVIDER=searxng
SEARCH_API_URL=http://localhost:8080

# Reranker (optional — self-hosted cross-encoder via generic provider)
# RERANKER_PROVIDER=generic
# RERANKER_MODEL=BAAI/bge-reranker-v2-m3
# RERANKER_API_URL=http://localhost:8787

# HyDE — improves search_knowledge quality at ~500-1500ms extra latency
# SKIPPER_ENABLE_HYDE=true

Pull models first:

ollama pull qwen2.5:32b
ollama pull qwen2.5:7b
ollama pull nomic-embed-text

No hardware needed.

# LLM — Claude Sonnet
LLM_PROVIDER=anthropic
LLM_MODEL=claude-sonnet-4-5-20250929
LLM_API_KEY=sk-ant-...

# Utility LLM — Haiku for background tasks (contextual retrieval)
UTILITY_LLM_PROVIDER=anthropic
UTILITY_LLM_MODEL=claude-haiku-4-5-20251001
UTILITY_LLM_API_KEY=sk-ant-...

# Embeddings — OpenAI (Anthropic doesn't offer embeddings)
EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_API_KEY=sk-...

# Web search
SEARCH_PROVIDER=tavily
SEARCH_API_KEY=tvly-...

# Reranker — Cohere rerank
RERANKER_PROVIDER=cohere
RERANKER_MODEL=rerank-4-pro
RERANKER_API_KEY=...

# HyDE — improves search_knowledge quality at ~500-1500ms extra latency
# SKIPPER_ENABLE_HYDE=true

One API key, access to all providers. Swap models by changing one string.

# LLM — any model via OpenRouter (OpenAI-compatible)
LLM_PROVIDER=openai
LLM_MODEL=anthropic/claude-sonnet-4-5
LLM_API_KEY=sk-or-...
LLM_API_URL=https://openrouter.ai/api/v1

# Utility LLM — cheap model via OpenRouter for background tasks
UTILITY_LLM_PROVIDER=openai
UTILITY_LLM_MODEL=anthropic/claude-haiku-4-5
UTILITY_LLM_API_KEY=sk-or-...
UTILITY_LLM_API_URL=https://openrouter.ai/api/v1

# Embeddings — OpenAI directly
EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_API_KEY=sk-...

# Web search
SEARCH_PROVIDER=brave
SEARCH_API_KEY=BSA...

OpenRouter doesn’t proxy embeddings or reranking, so use provider APIs directly for those.

# Reranker — pick one:

# Voyage AI (200M free tokens, instruction-following)
RERANKER_PROVIDER=voyage
RERANKER_MODEL=rerank-2.5
RERANKER_API_KEY=pa-...

# ZeroEntropy (#1 quality, fastest, half-price)
# RERANKER_PROVIDER=generic
# RERANKER_MODEL=zerank-2
# RERANKER_API_KEY=ze-...
# RERANKER_API_URL=https://api.zeroentropy.dev/v1/models

# Jina (cheapest, strong multilingual)
# RERANKER_PROVIDER=jina
# RERANKER_MODEL=jina-reranker-v2-base-multilingual
# RERANKER_API_KEY=jina_...

Local embeddings (free), API LLM (no GPU needed).

# LLM — API
LLM_PROVIDER=anthropic
LLM_MODEL=claude-sonnet-4-5-20250929
LLM_API_KEY=sk-ant-...

# Utility LLM — Haiku for background tasks (much cheaper than Sonnet)
UTILITY_LLM_PROVIDER=anthropic
UTILITY_LLM_MODEL=claude-haiku-4-5-20251001
UTILITY_LLM_API_KEY=sk-ant-...

# Embeddings — local Ollama
EMBEDDING_PROVIDER=ollama
EMBEDDING_MODEL=nomic-embed-text
EMBEDDING_API_URL=http://localhost:11434

# Web search
SEARCH_PROVIDER=tavily
SEARCH_API_KEY=tvly-...

# Reranker — Cohere rerank
RERANKER_PROVIDER=cohere
RERANKER_MODEL=rerank-4-pro
RERANKER_API_KEY=...

# HyDE
SKIPPER_ENABLE_HYDE=true

Knowledge Base

Skipper’s knowledge base is populated by crawling documentation sources and embedding them into pgvector.

How the Crawler Works

Fetch pages from sitemaps or direct URLs 2. Detect whether the page needs headless rendering (SPA detection) 3. Extract readable text via Readability → Markdown (strips navigation, boilerplate) 4. Chunk into ~500-token segments with 50-token overlap 5. Embed each chunk via the configured embedding model 6. Store in pgvector with metadata (source URL, title, source type, ingestion timestamp)

The full ingestion pipeline handles everything from sitemap discovery through to vector storage:

graph TD
    SRC["Sitemap URLs / Direct Pages / Uploads"] --> FETCH["Fetch Sitemap XML"]
    FETCH --> VALIDATE["URL Validation<br/><small>SSRF check · DNS resolution ·<br/>private CIDR blocking</small>"]
    VALIDATE --> ROBOTS["robots.txt<br/><small>SkipperBot/1.0</small>"]
    ROBOTS --> CACHE{"Cached?<br/><small>TTL · ETag · Hash</small>"}
    CACHE -->|unchanged| SKIP[Skip]
    CACHE -->|new or changed| HTTP["HTTP Fetch"]
    HTTP --> DETECT{"SPA Detection<br/><small>score ≥ 4?</small>"}
    DETECT -->|static| EXTRACT
    DETECT -->|SPA or empty shell| RENDER["Headless Chrome<br/><small>Rod · stealth mode ·<br/>blocks images/fonts/CSS</small>"]
    RENDER --> EXTRACT["Content Extraction<br/><small>Readability → Markdown<br/>fallback: DOM walker</small>"]
    EXTRACT --> HASH{"Content Hash<br/>SHA-256"}
    HASH -->|unchanged| SKIP
    HASH -->|new| CHUNK["Chunk<br/><small>~500 tokens · 50 overlap<br/>heading-aware blocks</small>"]
    CHUNK --> CTX{"Contextual<br/>Retrieval?"}
    CTX -->|enabled| UTIL["Utility LLM<br/><small>1-2 sentence context<br/>prepended per chunk</small>"]
    UTIL --> EMBED["Embed"]
    CTX -->|disabled| EMBED
    EMBED --> STORE["pgvector<br/><small>atomic upsert per source</small>"]

Change Detection

The crawler runs every CRAWL_INTERVAL (default 24h) and uses three layers of change detection to avoid unnecessary work:

Source TTL — skip sources crawled within the interval
HTTP 304 — conditional fetch with ETag / If-Modified-Since
Content hash — SHA-256 comparison skips re-embedding unchanged pages

When rendering is enabled, Skipper also sends a HEAD request before launching Chrome — if the Content-Length matches the cached value, it skips headless rendering entirely.

SPA Rendering

Many documentation sites are JavaScript-heavy SPAs that return empty shells to a plain HTTP fetch. When SKIPPER_ENABLE_RENDERING=true, the crawler auto-detects these pages using a scoring heuristic:

SPA mount points (<div id="root">, <div id="app">, <div id="__next">)
<noscript> tags, framework markers (data-reactroot, ng-app, data-v-)
High script-to-text ratio, low text density in <body>

If the score reaches 4 or the extracted text has fewer than 10 words, Skipper renders the page in headless Chromium (via Rod) with stealth mode enabled and non-essential resources (images, fonts, CSS) blocked. The browser waits 500ms for DOM stability before extracting the rendered HTML.

Content is extracted using Mozilla’s Readability algorithm (converted to Markdown). If Readability returns too little text, a fallback DOM walker strips navigation, sidebars, and hidden elements.

Contextual Retrieval

When SKIPPER_CONTEXTUAL_RETRIEVAL=true, Skipper uses a utility LLM to prepend 1-2 sentences of context to each chunk before embedding. This helps disambiguation — a chunk about “buffer settings” gets tagged with whether it’s about OBS, FFmpeg, or MistServer. The context is used only for embedding; the original chunk text is stored for retrieval.

Retrieval Quality

Three features improve retrieval accuracy at query time. All are optional and independent — enable what fits your latency and cost budget.

Cross-encoder reranking (RERANKER_PROVIDER) replaces the default keyword-overlap heuristic with a model that scores (query, chunk) pairs together. This understands semantic equivalence that keyword matching cannot — “rebuffering issues” matches “playback stalling”, “my stream keeps dying” matches “connection timeout troubleshooting”. Applied to both pre-retrieval (every message) and search_knowledge tool calls.

Provider	Model	Quality	Latency	Cost	Notes
Cohere	`rerank-4-pro`	#2 ELO	~600ms	~$2/1K queries	Incumbent, widest cloud availability (Bedrock, Azure). `rerank-4-fast` trades quality for speed.
Voyage AI	`rerank-2.5`	#4 ELO	~600ms	$0.05/1M tokens	200M free tokens. Instruction-following. 1K docs/request. Backed by MongoDB.
Jina	`jina-reranker-v2-base-multilingual`	#12 ELO	~750ms	~$0.02/1M tokens	Cheapest. Strong multilingual and code search. 131K context in v3.
ZeroEntropy	`zerank-2`	#1 ELO	~265ms	$0.025/1M tokens	Best quality + speed + price. Use `generic` provider.
Contextual AI	`rerank-v2-instruct`	#9 ELO	~3.3s	$0.05/1M tokens	Instruction-following, recency-aware. Use `generic` provider.
Generic	Any `/v1/rerank` endpoint	Varies	Varies	Varies	Self-hosted BGE/MXBai models, or any compatible API.

Query rewriting (automatic when utility LLM is configured) transforms conversational questions into search-optimized queries before embedding. “My European viewers are buffering a lot” becomes “European CDN edge rebuffering latency troubleshooting”. Applied to search_knowledge and search_web tool calls; skipped for pre-retrieval to keep latency low.

HyDE (SKIPPER_ENABLE_HYDE=true) generates a hypothetical answer to the user’s question, then embeds that answer instead of the question for vector search. The resulting vector is closer in embedding space to real documentation. Adds ~500-1500ms latency per search_knowledge call. Not used for pre-retrieval or web search.

Built-in Sources

Skipper ships with curated source files in config/skipper/sitemaps/:

File	Content
`frameworks.txt`	Platform docs and marketing site (env-templated URLs)
`ecosystem.txt`	MistServer, Livepeer, Daydream, WebRTC, DASH, Streamplace
`obs.txt`	OBS Studio knowledge base articles
`ffmpeg.txt`	FFmpeg tools, codecs, formats, protocols, filters
`srt.txt`	SRT protocol docs from Haivision
`nginx-rtmp.txt`	nginx-rtmp-module wiki
`hls-spec.txt`	HLS specification (IETF RFCs)

Set SKIPPER_SITEMAPS_DIR to the directory containing these files. In Docker Compose, this is mounted as a read-only volume at /etc/skipper/sitemaps.

Adding Custom Sources

Create a .txt file in the sitemaps directory. Two formats are supported:

Sitemap URLs — standard XML sitemaps:

https://logbook.example.com/sitemap.xml

Direct page URLs — for sites without sitemaps, prefix with page::

# Internal documentation
page:https://wiki.example.com/streaming-guide
page:https://wiki.example.com/encoder-settings

Force-rendered URLs — for SPAs that always need headless Chrome, prefix with render::

# React/Next.js docs that return empty shells without JS
render:https://spa-docs.example.com/getting-started
render:https://spa-docs.example.com/api-reference

Lines starting with # are comments. Empty lines are ignored. Environment variables are expanded (e.g., ${DOCS_PUBLIC_URL}).

Uploading Documents

The admin API accepts direct document uploads:

curl -X POST /api/skipper/admin/pages \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"url": "internal://runbook", "title": "Streaming Runbook", "content": "..."}'

Uploaded content is embedded and stored as tenant-specific knowledge. Supported file types for multipart upload: .txt, .md, .html, .csv, .json, .xml (max 10MB).

Web Search Providers

Web search is optional — Skipper works without it, relying on the knowledge base alone. When configured, Skipper falls back to web search if the knowledge base doesn’t have a good answer.

Provider	Setup	Notes
Tavily	`SEARCH_PROVIDER=tavily` + API key	Returns clean extracted content. Best for RAG.
Brave Search	`SEARCH_PROVIDER=brave` + API key	Fast, privacy-focused. Returns snippets.
SearXNG	`SEARCH_PROVIDER=searxng` + `SEARCH_API_URL`	Self-hosted, no API key needed.

Self-Hosting Dependencies

Skipper can run entirely on your own infrastructure with no external API calls. This section covers deploying the self-hostable components.

Ollama (LLM + Embeddings)

Ollama runs open-weight models locally. It provides both chat completion and embedding endpoints that Skipper uses directly.

Docker (recommended for production):

ollama:
  image: ollama/ollama:latest
  ports:
    - "11434:11434"
  volumes:
    - ollama_data:/root/.ollama # persist downloaded models
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia # GPU passthrough (NVIDIA)
            count: all
            capabilities: [gpu]
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s

For AMD GPUs, use image: ollama/ollama:rocm. For CPU-only, omit the deploy.resources block — expect ~2-5 tok/s on a 32B model.

Pull models after starting Ollama:

# Chat model (pick one per tier — see Model Selection above)
docker exec ollama ollama pull qwen2.5:32b

# Embedding model
docker exec ollama ollama pull nomic-embed-text

# Utility model (optional, for contextual retrieval / query rewriting)
docker exec ollama ollama pull qwen2.5:7b

Skipper env vars for Ollama:

LLM_PROVIDER=ollama
LLM_MODEL=qwen2.5:32b
LLM_API_URL=http://ollama:11434/v1       # use container name in Docker networks

EMBEDDING_PROVIDER=ollama
EMBEDDING_MODEL=nomic-embed-text
EMBEDDING_API_URL=http://ollama:11434     # embeddings use /api/embeddings, not /v1

SearXNG (Web Search)

SearXNG is a self-hosted metasearch engine. No API key needed.

docker run -d --name searxng -p 8080:8080 searxng/searxng:latest

Skipper env vars:

SEARCH_PROVIDER=searxng
SEARCH_API_URL=http://searxng:8080

Self-Hosted Reranker

For fully local retrieval quality, run a cross-encoder model via Text Embeddings Inference (TEI) or a similar /v1/rerank-compatible server:

docker run -d --name reranker -p 8787:80 \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-reranker-v2-m3

Skipper env vars:

RERANKER_PROVIDER=generic
RERANKER_MODEL=BAAI/bge-reranker-v2-m3
RERANKER_API_URL=http://reranker:8787

Putting It Together

A fully self-hosted stack with proper health checks and startup ordering:

services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 15s
      timeout: 5s
      retries: 5
      start_period: 30s

  searxng:
    image: searxng/searxng:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"]
      interval: 10s
      timeout: 5s
      retries: 3

  skipper:
    # ... your existing Skipper config ...
    environment:
      LLM_PROVIDER: ollama
      LLM_MODEL: qwen2.5:32b
      LLM_API_URL: http://ollama:11434/v1
      EMBEDDING_PROVIDER: ollama
      EMBEDDING_MODEL: nomic-embed-text
      EMBEDDING_API_URL: http://ollama:11434
      SEARCH_PROVIDER: searxng
      SEARCH_API_URL: http://searxng:8080
    depends_on:
      ollama:
        condition: service_healthy
      searxng:
        condition: service_healthy

volumes:
  ollama_data:

Architecture

graph TD
    Q["User Query"] --> PRE["Pre-retrieval<br/><small>auto · every message · fast path</small>"]
    Q --> TOOL["search_knowledge<br/><small>explicit LLM tool call</small>"]
    Q --> WEB["search_web<br/><small>explicit LLM tool call</small>"]

    PRE --> EMB1["Embed Query"]
    EMB1 --> HYB1["Hybrid Search"]
    HYB1 --> RERANK1["Cross-Encoder Rerank"]
    RERANK1 --> DEDUP1["Deduplicate"]
    DEDUP1 --> LLM

    TOOL --> REWRITE["Query Rewrite<br/><small>utility LLM</small>"]
    REWRITE --> HYDE{"HyDE<br/>enabled?"}
    HYDE -->|yes| HYPO["Generate Hypothetical Answer<br/><small>utility LLM → embed</small>"]
    HYDE -->|no| EMB2["Embed Rewritten Query"]
    HYPO --> HYB2["Hybrid Search"]
    EMB2 --> HYB2
    HYB2 --> RERANK2["Cross-Encoder Rerank"]
    RERANK2 --> DEDUP2["Deduplicate"]
    DEDUP2 --> LLM

    WEB --> REWRITE2["Query Rewrite<br/><small>utility LLM</small>"]
    REWRITE2 --> SEARCH["Web Search Provider"]
    SEARCH --> LLM

    LLM["LLM + MCP Tools"] --> CONF["Confidence Tagging"]
    CONF --> RESP["Response with Citations"]

Skipper runs as the api_consultant service (ports 18018 HTTP, 19007 gRPC). It connects to the API Gateway via MCP to access platform tools (diagnostics, stream management, GraphQL). The Gateway proxies Skipper’s ask_consultant tool to external MCP agents, which runs the full orchestrator pipeline internally.

Retrieval and Reranking

Every user message triggers an automatic pre-retrieval pass that searches both tenant-specific and global knowledge. The LLM can also explicitly call search_knowledge for targeted lookups.

Search uses a hybrid approach: 70% cosine vector similarity + 30% PostgreSQL full-text ranking. Results are then reranked — when a cross-encoder is configured (RERANKER_PROVIDER), it scores (query, chunk) pairs together for semantic understanding; otherwise a keyword-overlap heuristic is used (0.7 × vector similarity + 0.3 × query term overlap). Results are deduplicated to a maximum of 2 chunks per source URL.

Query rewriting (requires utility LLM) transforms conversational questions into search-optimized queries before embedding. This bridges vocabulary gaps between how users phrase questions and how documentation is written. Applied to search_knowledge and search_web tool calls; skipped for pre-retrieval to keep latency low.

HyDE (SKIPPER_ENABLE_HYDE=true) generates a hypothetical answer via the utility LLM, then embeds that answer instead of the question for vector search. The resulting vector is closer in embedding space to real documentation. Adds ~500-1500ms latency per search_knowledge call.

Dependencies

Dependency	Purpose
PostgreSQL + pgvector	Vector store, conversations, usage tracking
LLM provider	Chat completion (OpenAI, Anthropic, or Ollama)
Embedding provider	Document/query embedding (OpenAI or Ollama)
API Gateway (MCP)	Platform tools — diagnostics, stream CRUD, GraphQL
Periscope (gRPC)	Stream health metrics (via Gateway)
Commodore (gRPC)	Tenant and stream context

Heartbeat Monitoring

Skipper periodically analyzes the health of active streams and infrastructure. The heartbeat agent runs every HEARTBEAT_INTERVAL (default 30 minutes) and processes each eligible tenant.

How It Works

For each tenant with active streams and a qualifying billing tier:

Snapshot — fetches stream health and client QoE metrics from Periscope for the last 15 minutes
Baseline comparison — compares current metrics against Welford running averages that the heartbeat has been building over time. Deviations are detected when a metric exceeds 2 standard deviations from the mean (requires at least 5 samples to avoid false positives during warmup).
Triage — a deterministic decision cascade (no LLM calls):
- Hard threshold violation → investigate
- Cross-metric correlation with ≥ 50% confidence → investigate
- Baseline deviations → flag for review
- Everything normal → skip
Per-stream drill-down — when something looks wrong, Skipper fetches per-stream metrics, compares each against the tenant-wide baseline, and identifies the most anomalous streams (up to 20).
Investigation — only when the triage result is “investigate”, Skipper runs the chat orchestrator with the full diagnostic context (deviations, correlations, per-stream anomalies, raw metrics). This is the only step that uses LLM tokens.
Notification — investigation reports and flag summaries are dispatched via email, WebSocket, or MCP.

Healthy tenants consume zero LLM calls per heartbeat cycle.

Cross-Metric Correlation

The diagnostics engine matches deviation patterns against 5 known failure hypotheses: network degradation, encoder overload, viewer-side issues, ingest instability, and CDN pressure. Each hypothesis has expected signal patterns (e.g., network degradation = packet_loss↑ + bandwidth_in↓ + buffer_health↓). Confidence is calculated as matched signals / total expected signals.

Infrastructure Monitoring

Independently of per-tenant stream health, the heartbeat checks node-level metrics across all active clusters:

CPU ≥ 95% and memory ≥ 95% require the violation to persist across 3 of 4 five-minute windows before alerting (prevents transient spikes)
Disk ≥ 90% triggers a warning; ≥ 95% is critical (fires immediately since disk doesn’t self-heal)
Alerts are emailed to the cluster owner with a 4-hour cooldown per node/alert type

Configuration

Heartbeat:

Variable	Purpose	Default
`HEARTBEAT_INTERVAL`	How often to run the heartbeat cycle	`30m`
`SKIPPER_REQUIRED_TIER_LEVEL`	Minimum billing tier for heartbeat processing	`3`

Notifications (email):

Variable	Purpose	Default
`SMTP_HOST`	SMTP server hostname	—
`SMTP_PORT`	SMTP server port	`587`
`SMTP_FROM`	Sender email address	—
`SMTP_USERNAME`	SMTP authentication username	—
`SMTP_PASSWORD`	SMTP authentication password	—

When enabled, Skipper drafts social media posts based on noteworthy platform events. Posts are sent as drafts to a configured email address for human review — nothing is auto-published.

How It Works

The social agent checks for noteworthy events every SKIPPER_SOCIAL_INTERVAL (default 2 hours):

Event collection — the heartbeat agent and knowledge crawler push signals into a collector as they run (platform stats, federation metrics, newly embedded pages)
Detection — the detector classifies signals, compares against stored baselines, and scores them:
- Platform stats: new viewer record, bandwidth milestone (1/10/100/1000 Gbps), significant viewer surge (>25% growth)
- Federation: latency improvement (>20% drop), event volume milestone
- Knowledge: newly crawled and embedded documentation page
Composition — the utility LLM drafts a tweet (max 280 characters). It receives the last 10 posts to avoid repeating themes. If the draft exceeds 280 characters, it retries once, then truncates at the nearest word boundary.
Publishing — the draft is saved to the database and emailed to SKIPPER_SOCIAL_NOTIFY_EMAIL for review.

The first observation for each signal type saves a baseline instead of posting — subsequent signals are compared against it.

Configuration

Variable	Purpose	Default
`SKIPPER_SOCIAL_ENABLED`	Enable the social posting agent	`false`
`SKIPPER_SOCIAL_INTERVAL`	How often to check for noteworthy events	`2h`
`SKIPPER_SOCIAL_MAX_PER_DAY`	Max posts per day (`0` = unlimited)	`2`
`SKIPPER_SOCIAL_NOTIFY_EMAIL`	Email to send draft tweets to (required when enabled)	—

Skipper Setup

Model Selection

Embeddings

LLM — Local (Ollama)

LLM — Hosted or Network-Provided

Cost Estimates

Configuration

Environment Variables

Configuration Examples

Knowledge Base

How the Crawler Works

Change Detection

SPA Rendering

Contextual Retrieval

Retrieval Quality

Built-in Sources

Adding Custom Sources

Uploading Documents

Web Search Providers

Self-Hosting Dependencies

Ollama (LLM + Embeddings)

SearXNG (Web Search)

Self-Hosted Reranker

Putting It Together

Architecture

Retrieval and Reranking

Dependencies

Heartbeat Monitoring

How It Works

Cross-Metric Correlation

Infrastructure Monitoring

Configuration

Social Posting

How It Works

Configuration