Skipper Setup
Skipper requires three things: an LLM for chat, an embedding model for the knowledge base, and PostgreSQL with pgvector. Optionally, add a web search provider for live lookups beyond the knowledge base.
All configuration is through environment variables. The LLM and embedding providers are independent — you can mix local embeddings with an API-hosted LLM.
Model Selection
Section titled “Model Selection”Embeddings
Section titled “Embeddings”Embedding models convert text into vectors for similarity search. Quality here affects every answer — bad retrieval poisons everything downstream.
| Option | Run where | Notes |
|---|---|---|
nomic-embed-text | Ollama / local CPU or GPU | Strong default for local deployments. |
bge-m3 | Ollama / local CPU or GPU | Multi-lingual, multi-granularity. Heavier but excellent. |
| Hosted embedding API | OpenAI-compatible provider | Simple operations path; verify current provider pricing. |
| Blueclaw / Livepeer-network AI | OpenAI-compatible or BYOC runtime | Good fit when you want network-provided inference/embeddings instead of separate SaaS APIs. |
Recommendation: Start with local or network-provided embeddings if available. If retrieval quality is insufficient, try a higher-quality hosted embedding model and re-crawl.
LLM — Local (Ollama)
Section titled “LLM — Local (Ollama)”Self-hosted via Ollama. No per-token cost — you pay for hardware.
Tier 1 — Lightweight (7-8B) — single 8-16GB GPU
| Model | Context | VRAM (Q4) | Notes |
|---|---|---|---|
| Qwen 3 8B | 128K | ~5GB | Strong instruction following + tool use |
| Llama 3.1 8B | 128K | ~5GB | Battle-tested, huge community |
| Mistral 7B v0.3 | 32K | ~4.5GB | Fast, good at structured prompts |
Good for simple lookups. Struggles with multi-hop reasoning across multiple chunks.
Tier 2 — Sweet Spot (14-32B) — single 24-48GB GPU (recommended)
| Model | Context | VRAM (Q4) | Notes |
|---|---|---|---|
| Qwen 2.5 32B | 128K | ~20GB | Best all-around. Strong RAG + instruction following. |
| Qwen 3 32B | 128K | ~20GB | Built-in thinking mode. Good for reasoning. |
| Mistral Small 3.1 24B | 128K | ~15GB | Excellent function calling. Fits on a 24GB card. |
| Command R 35B | 128K | ~22GB | Purpose-built for RAG by Cohere. Native citations. |
RTX 3090/4090 (24GB) runs Q4_K_M of 32B models at ~15-25 tok/s — fine for chat.
Tier 3 — Heavy (70B+) — 2x 24GB or 1x 80GB GPU
| Model | Context | VRAM (Q4) | Notes |
|---|---|---|---|
| Llama 3.3 70B | 128K | ~40GB | Near 405B quality. |
| Qwen 2.5 72B | 128K | ~42GB | Excellent multilingual + code. |
The jump from 32B to 70B is ~10-15% better on hard queries. Usually not worth 2-3x the hardware for a domain consultant with good RAG.
LLM — Hosted or Network-Provided
Section titled “LLM — Hosted or Network-Provided”No local hardware needed. Skipper uses OpenAI-compatible chat and embedding endpoints, so hosted SaaS, Blueclaw, Livepeer-network inference, or a BYOC model container can all fit the same configuration shape when they expose compatible APIs.
For Livepeer-network deployments, check the current Livepeer AI docs and ask in the Livepeer Discord before planning around a custom model or BYOC container. Availability changes as model runners and orchestrator support evolve. BYOC is useful when the model you want is not exposed by a managed endpoint yet.
| Option | Best For | Notes |
|---|---|---|
| Hosted SaaS model API | Fastest path to production | Verify current pricing and data policy directly with the provider. |
| Blueclaw | OpenAI-compatible gateway for agents | Can reduce vendor lock-in if it has the models you need. |
| Livepeer-network inference | Keeping inference spend and usage on-network | Good strategic fit when available for your target models. |
| Livepeer BYOC container | Custom models or runtime control | More operator work, but lets you bring models not exposed by default. |
Cost Estimates
Section titled “Cost Estimates”Skipper cost depends on:
- conversation history included in the prompt
- retrieved knowledge chunks and citations
- tool-call count and tool result size
- whether query rewriting, HyDE, reranking, or web search are enabled
- provider pricing, model choice, and whether inference runs locally or on-network
Early planning shape:
| Deployment shape | Cost profile | When to choose it |
|---|---|---|
| Local Ollama | Hardware and operations, no per-token API bill | Self-hosted clusters with spare GPU capacity. |
| Blueclaw / Livepeer-network AI | Network/provider pricing, less separate SaaS usage | Agents-first deployments and on-network usage goals. |
| Budget hosted SaaS model | Usually low at modest chat volume | Fast setup when answer quality requirements are modest. |
| Higher-quality hosted SaaS model | Can climb quickly with long prompts/tool results | Harder support cases or premium account tiers. |
For budgeting, use Skipper’s recorded tokensInput, tokensOutput, tool calls, and provider/model
fields as the source of truth. Real deployments should compare that usage data with current
provider pricing or local inference costs before committing to a support-tier margin.
Configuration
Section titled “Configuration”Environment Variables
Section titled “Environment Variables”LLM:
| Variable | Purpose | Default |
|---|---|---|
LLM_PROVIDER | openai, anthropic, or ollama | — |
LLM_MODEL | Model identifier | — |
LLM_API_KEY | API credentials | — |
LLM_API_URL | Custom endpoint (OpenRouter, self-hosted) | Provider default |
LLM_MAX_TOKENS | Max output tokens per response | 4096 |
Embeddings (falls back to LLM_* when unset):
| Variable | Purpose | Default |
|---|---|---|
EMBEDDING_PROVIDER | openai or ollama | LLM_PROVIDER |
EMBEDDING_MODEL | Embedding model name | LLM_MODEL |
EMBEDDING_API_KEY | Embedding API credentials | LLM_API_KEY |
EMBEDDING_API_URL | Embedding endpoint | LLM_API_URL |
Utility LLM (for background tasks like contextual retrieval; falls back to LLM_* when unset):
| Variable | Purpose | Default |
|---|---|---|
UTILITY_LLM_PROVIDER | Cheap LLM for background tasks (query rewriting, HyDE, contextual retrieval) | LLM_PROVIDER |
UTILITY_LLM_MODEL | Utility model identifier | LLM_MODEL |
UTILITY_LLM_API_KEY | Utility LLM credentials | LLM_API_KEY |
UTILITY_LLM_API_URL | Utility LLM endpoint | LLM_API_URL |
Web search (optional):
| Variable | Purpose | Default |
|---|---|---|
SEARCH_PROVIDER | tavily, brave, or searxng | — |
SEARCH_API_KEY | Search API key (not needed for SearXNG) | — |
SEARCH_API_URL | Custom endpoint (required for SearXNG) | Provider default |
Retrieval quality:
| Variable | Purpose | Default |
|---|---|---|
RERANKER_PROVIDER | Cross-encoder reranker: cohere, jina, voyage, or generic | — (keyword fallback) |
RERANKER_MODEL | Reranker model (e.g. rerank-4-pro, rerank-2.5, jina-reranker-v2-base-multilingual) | — |
RERANKER_API_KEY | Reranker API credentials | LLM_API_KEY |
RERANKER_API_URL | Reranker endpoint (required for generic provider) | Provider default |
SKIPPER_ENABLE_HYDE | Enable Hypothetical Document Embeddings for search_knowledge | false |
Knowledge base:
| Variable | Purpose | Default |
|---|---|---|
SITEMAPS | Comma-separated sitemap URLs | — |
SKIPPER_SITEMAPS_DIR | Directory of source files (re-read each cycle) | — |
CRAWL_INTERVAL | Refresh interval | 24h |
CHUNK_TOKEN_LIMIT | Max BPE tokens per chunk | 500 |
CHUNK_TOKEN_OVERLAP | Overlap tokens between adjacent chunks | 50 |
SKIPPER_ENABLE_RENDERING | Enable headless Chrome for JS-rendered pages | false |
SKIPPER_CONTEXTUAL_RETRIEVAL | Use utility LLM to prepend context before embedding | false |
SKIPPER_LINK_DISCOVERY | Discover and crawl same-domain links | false |
SKIPPER_SEARCH_LIMIT | Default result limit for search_knowledge | 8 |
Service:
| Variable | Purpose | Default |
|---|---|---|
SKIPPER_WEB_UI | Enable standalone web UI at / | true |
SKIPPER_API_KEY | API key for admin WebUI auth | — |
SKIPPER_WEB_UI_INSECURE | Allow the standalone WebUI without an API key | false |
SKIPPER_REQUIRED_TIER_LEVEL | Minimum subscription tier | 3 |
SKIPPER_CHAT_RATE_LIMIT_PER_HOUR | Rate limit per tenant | 0 (unlimited) |
SKIPPER_CHAT_RATE_LIMIT_OVERRIDES | Per-tenant overrides (tenant_id:limit,...) | — |
SKIPPER_ADMIN_TENANT_ID | Tenant ID for global/platform knowledge | — |
SKIPPER_MAX_HISTORY_MESSAGES | Max conversation messages loaded per request | 20 |
GATEWAY_MCP_URL | Internal Gateway MCP endpoint for platform tools | derived from Bridge mesh URL |
GATEWAY_PUBLIC_URL | Public API Gateway base URL; fallback for MCP when GATEWAY_MCP_URL is unset | — |
Social posting:
| Variable | Purpose | Default |
|---|---|---|
SKIPPER_SOCIAL_ENABLED | Enable event-driven social posting agent | false |
SKIPPER_SOCIAL_INTERVAL | How often to check for noteworthy events | 2h |
SKIPPER_SOCIAL_MAX_PER_DAY | Max posts per day (0 = unlimited) | 2 |
SKIPPER_SOCIAL_NOTIFY_EMAIL | Email to send draft tweets to (required when enabled) | — |
Configuration Examples
Section titled “Configuration Examples”Zero ongoing cost. Requires a GPU for the LLM; embeddings run on CPU.
# LLMLLM_PROVIDER=ollamaLLM_MODEL=qwen2.5:32bLLM_API_URL=http://localhost:11434/v1
# Utility LLM — cheap model for background tasks (contextual retrieval)UTILITY_LLM_PROVIDER=ollamaUTILITY_LLM_MODEL=qwen2.5:7bUTILITY_LLM_API_URL=http://localhost:11434/v1
# Embeddings (same Ollama instance, different endpoint)EMBEDDING_PROVIDER=ollamaEMBEDDING_MODEL=nomic-embed-textEMBEDDING_API_URL=http://localhost:11434
# Web search (optional, self-hosted)SEARCH_PROVIDER=searxngSEARCH_API_URL=http://localhost:8080
# Reranker (optional — self-hosted cross-encoder via generic provider)# RERANKER_PROVIDER=generic# RERANKER_MODEL=BAAI/bge-reranker-v2-m3# RERANKER_API_URL=http://localhost:8787
# HyDE — improves search_knowledge quality at ~500-1500ms extra latency# SKIPPER_ENABLE_HYDE=truePull models first:
ollama pull qwen2.5:32bollama pull qwen2.5:7bollama pull nomic-embed-textNo hardware needed.
# LLM — Claude SonnetLLM_PROVIDER=anthropicLLM_MODEL=claude-sonnet-4-5-20250929LLM_API_KEY=sk-ant-...
# Utility LLM — Haiku for background tasks (contextual retrieval)UTILITY_LLM_PROVIDER=anthropicUTILITY_LLM_MODEL=claude-haiku-4-5-20251001UTILITY_LLM_API_KEY=sk-ant-...
# Embeddings — OpenAI (Anthropic doesn't offer embeddings)EMBEDDING_PROVIDER=openaiEMBEDDING_MODEL=text-embedding-3-smallEMBEDDING_API_KEY=sk-...
# Web searchSEARCH_PROVIDER=tavilySEARCH_API_KEY=tvly-...
# Reranker — Cohere rerankRERANKER_PROVIDER=cohereRERANKER_MODEL=rerank-4-proRERANKER_API_KEY=...
# HyDE — improves search_knowledge quality at ~500-1500ms extra latency# SKIPPER_ENABLE_HYDE=trueOne API key, access to all providers. Swap models by changing one string.
# LLM — any model via OpenRouter (OpenAI-compatible)LLM_PROVIDER=openaiLLM_MODEL=anthropic/claude-sonnet-4-5LLM_API_KEY=sk-or-...LLM_API_URL=https://openrouter.ai/api/v1
# Utility LLM — cheap model via OpenRouter for background tasksUTILITY_LLM_PROVIDER=openaiUTILITY_LLM_MODEL=anthropic/claude-haiku-4-5UTILITY_LLM_API_KEY=sk-or-...UTILITY_LLM_API_URL=https://openrouter.ai/api/v1
# Embeddings — OpenAI directlyEMBEDDING_PROVIDER=openaiEMBEDDING_MODEL=text-embedding-3-smallEMBEDDING_API_KEY=sk-...
# Web searchSEARCH_PROVIDER=braveSEARCH_API_KEY=BSA...OpenRouter doesn’t proxy embeddings or reranking, so use provider APIs directly for those.
# Reranker — pick one:
# Voyage AI (200M free tokens, instruction-following)RERANKER_PROVIDER=voyageRERANKER_MODEL=rerank-2.5RERANKER_API_KEY=pa-...
# ZeroEntropy (#1 quality, fastest, half-price)# RERANKER_PROVIDER=generic# RERANKER_MODEL=zerank-2# RERANKER_API_KEY=ze-...# RERANKER_API_URL=https://api.zeroentropy.dev/v1/models
# Jina (cheapest, strong multilingual)# RERANKER_PROVIDER=jina# RERANKER_MODEL=jina-reranker-v2-base-multilingual# RERANKER_API_KEY=jina_...Local embeddings (free), API LLM (no GPU needed).
# LLM — APILLM_PROVIDER=anthropicLLM_MODEL=claude-sonnet-4-5-20250929LLM_API_KEY=sk-ant-...
# Utility LLM — Haiku for background tasks (much cheaper than Sonnet)UTILITY_LLM_PROVIDER=anthropicUTILITY_LLM_MODEL=claude-haiku-4-5-20251001UTILITY_LLM_API_KEY=sk-ant-...
# Embeddings — local OllamaEMBEDDING_PROVIDER=ollamaEMBEDDING_MODEL=nomic-embed-textEMBEDDING_API_URL=http://localhost:11434
# Web searchSEARCH_PROVIDER=tavilySEARCH_API_KEY=tvly-...
# Reranker — Cohere rerankRERANKER_PROVIDER=cohereRERANKER_MODEL=rerank-4-proRERANKER_API_KEY=...
# HyDESKIPPER_ENABLE_HYDE=trueKnowledge Base
Section titled “Knowledge Base”Skipper’s knowledge base is populated by crawling documentation sources and embedding them into pgvector.
How the Crawler Works
Section titled “How the Crawler Works”- Fetch pages from sitemaps or direct URLs 2. Detect whether the page needs headless rendering (SPA detection) 3. Extract readable text via Readability → Markdown (strips navigation, boilerplate) 4. Chunk into ~500-token segments with 50-token overlap 5. Embed each chunk via the configured embedding model 6. Store in pgvector with metadata (source URL, title, source type, ingestion timestamp)
The full ingestion pipeline handles everything from sitemap discovery through to vector storage:
graph TD
SRC["Sitemap URLs / Direct Pages / Uploads"] --> FETCH["Fetch Sitemap XML"]
FETCH --> VALIDATE["URL Validation<br/><small>SSRF check · DNS resolution ·<br/>private CIDR blocking</small>"]
VALIDATE --> ROBOTS["robots.txt<br/><small>SkipperBot/1.0</small>"]
ROBOTS --> CACHE{"Cached?<br/><small>TTL · ETag · Hash</small>"}
CACHE -->|unchanged| SKIP[Skip]
CACHE -->|new or changed| HTTP["HTTP Fetch"]
HTTP --> DETECT{"SPA Detection<br/><small>score ≥ 4?</small>"}
DETECT -->|static| EXTRACT
DETECT -->|SPA or empty shell| RENDER["Headless Chrome<br/><small>Rod · stealth mode ·<br/>blocks images/fonts/CSS</small>"]
RENDER --> EXTRACT["Content Extraction<br/><small>Readability → Markdown<br/>fallback: DOM walker</small>"]
EXTRACT --> HASH{"Content Hash<br/>SHA-256"}
HASH -->|unchanged| SKIP
HASH -->|new| CHUNK["Chunk<br/><small>~500 tokens · 50 overlap<br/>heading-aware blocks</small>"]
CHUNK --> CTX{"Contextual<br/>Retrieval?"}
CTX -->|enabled| UTIL["Utility LLM<br/><small>1-2 sentence context<br/>prepended per chunk</small>"]
UTIL --> EMBED["Embed"]
CTX -->|disabled| EMBED
EMBED --> STORE["pgvector<br/><small>atomic upsert per source</small>"]
Change Detection
Section titled “Change Detection”The crawler runs every CRAWL_INTERVAL (default 24h) and uses three layers of change detection to avoid unnecessary work:
- Source TTL — skip sources crawled within the interval
- HTTP 304 — conditional fetch with ETag / If-Modified-Since
- Content hash — SHA-256 comparison skips re-embedding unchanged pages
When rendering is enabled, Skipper also sends a HEAD request before launching Chrome — if the Content-Length matches the cached value, it skips headless rendering entirely.
SPA Rendering
Section titled “SPA Rendering”Many documentation sites are JavaScript-heavy SPAs that return empty shells to a plain HTTP fetch. When SKIPPER_ENABLE_RENDERING=true, the crawler auto-detects these pages using a scoring heuristic:
- SPA mount points (
<div id="root">,<div id="app">,<div id="__next">) <noscript>tags, framework markers (data-reactroot,ng-app,data-v-)- High script-to-text ratio, low text density in
<body>
If the score reaches 4 or the extracted text has fewer than 10 words, Skipper renders the page in headless Chromium (via Rod) with stealth mode enabled and non-essential resources (images, fonts, CSS) blocked. The browser waits 500ms for DOM stability before extracting the rendered HTML.
Content is extracted using Mozilla’s Readability algorithm (converted to Markdown). If Readability returns too little text, a fallback DOM walker strips navigation, sidebars, and hidden elements.
Contextual Retrieval
Section titled “Contextual Retrieval”When SKIPPER_CONTEXTUAL_RETRIEVAL=true, Skipper uses a utility LLM to prepend 1-2 sentences of context to each chunk before embedding. This helps disambiguation — a chunk about “buffer settings” gets tagged with whether it’s about OBS, FFmpeg, or MistServer. The context is used only for embedding; the original chunk text is stored for retrieval.
Retrieval Quality
Section titled “Retrieval Quality”Three features improve retrieval accuracy at query time. All are optional and independent — enable what fits your latency and cost budget.
Cross-encoder reranking (RERANKER_PROVIDER) replaces the default keyword-overlap heuristic with a model that scores (query, chunk) pairs together. This understands semantic equivalence that keyword matching cannot — “rebuffering issues” matches “playback stalling”, “my stream keeps dying” matches “connection timeout troubleshooting”. Applied to both pre-retrieval (every message) and search_knowledge tool calls.
| Provider | Model | Quality | Latency | Cost | Notes |
|---|---|---|---|---|---|
| Cohere | rerank-4-pro | #2 ELO | ~600ms | ~$2/1K queries | Incumbent, widest cloud availability (Bedrock, Azure). rerank-4-fast trades quality for speed. |
| Voyage AI | rerank-2.5 | #4 ELO | ~600ms | $0.05/1M tokens | 200M free tokens. Instruction-following. 1K docs/request. Backed by MongoDB. |
| Jina | jina-reranker-v2-base-multilingual | #12 ELO | ~750ms | ~$0.02/1M tokens | Cheapest. Strong multilingual and code search. 131K context in v3. |
| ZeroEntropy | zerank-2 | #1 ELO | ~265ms | $0.025/1M tokens | Best quality + speed + price. Use generic provider. |
| Contextual AI | rerank-v2-instruct | #9 ELO | ~3.3s | $0.05/1M tokens | Instruction-following, recency-aware. Use generic provider. |
| Generic | Any /v1/rerank endpoint | Varies | Varies | Varies | Self-hosted BGE/MXBai models, or any compatible API. |
Query rewriting (automatic when utility LLM is configured) transforms conversational questions into search-optimized queries before embedding. “My European viewers are buffering a lot” becomes “European CDN edge rebuffering latency troubleshooting”. Applied to search_knowledge and search_web tool calls; skipped for pre-retrieval to keep latency low.
HyDE (SKIPPER_ENABLE_HYDE=true) generates a hypothetical answer to the user’s question, then embeds that answer instead of the question for vector search. The resulting vector is closer in embedding space to real documentation. Adds ~500-1500ms latency per search_knowledge call. Not used for pre-retrieval or web search.
Built-in Sources
Section titled “Built-in Sources”Skipper ships with curated source files in config/skipper/sitemaps/:
| File | Content |
|---|---|
frameworks.txt | Platform docs and marketing site (env-templated URLs) |
ecosystem.txt | MistServer, Livepeer, Daydream, WebRTC, DASH, Streamplace |
obs.txt | OBS Studio knowledge base articles |
ffmpeg.txt | FFmpeg tools, codecs, formats, protocols, filters |
srt.txt | SRT protocol docs from Haivision |
nginx-rtmp.txt | nginx-rtmp-module wiki |
hls-spec.txt | HLS specification (IETF RFCs) |
Set SKIPPER_SITEMAPS_DIR to the directory containing these files. In Docker Compose, this is mounted as a read-only volume at /etc/skipper/sitemaps.
Adding Custom Sources
Section titled “Adding Custom Sources”Create a .txt file in the sitemaps directory. Two formats are supported:
Sitemap URLs — standard XML sitemaps:
https://logbook.example.com/sitemap.xmlDirect page URLs — for sites without sitemaps, prefix with page::
# Internal documentationpage:https://wiki.example.com/streaming-guidepage:https://wiki.example.com/encoder-settingsForce-rendered URLs — for SPAs that always need headless Chrome, prefix with render::
# React/Next.js docs that return empty shells without JSrender:https://spa-docs.example.com/getting-startedrender:https://spa-docs.example.com/api-referenceLines starting with # are comments. Empty lines are ignored. Environment variables are expanded (e.g., ${DOCS_PUBLIC_URL}).
Uploading Documents
Section titled “Uploading Documents”The admin API accepts direct document uploads:
curl -X POST /api/skipper/admin/pages \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -d '{"url": "internal://runbook", "title": "Streaming Runbook", "content": "..."}'Uploaded content is embedded and stored as tenant-specific knowledge. Supported file types for multipart upload: .txt, .md, .html, .csv, .json, .xml (max 10MB).
Web Search Providers
Section titled “Web Search Providers”Web search is optional — Skipper works without it, relying on the knowledge base alone. When configured, Skipper falls back to web search if the knowledge base doesn’t have a good answer.
| Provider | Setup | Notes |
|---|---|---|
| Tavily | SEARCH_PROVIDER=tavily + API key | Returns clean extracted content. Best for RAG. |
| Brave Search | SEARCH_PROVIDER=brave + API key | Fast, privacy-focused. Returns snippets. |
| SearXNG | SEARCH_PROVIDER=searxng + SEARCH_API_URL | Self-hosted, no API key needed. |
Self-Hosting Dependencies
Section titled “Self-Hosting Dependencies”Skipper can run entirely on your own infrastructure with no external API calls. This section covers deploying the self-hostable components.
Ollama (LLM + Embeddings)
Section titled “Ollama (LLM + Embeddings)”Ollama runs open-weight models locally. It provides both chat completion and embedding endpoints that Skipper uses directly.
Docker (recommended for production):
ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama_data:/root/.ollama # persist downloaded models deploy: resources: reservations: devices: - driver: nvidia # GPU passthrough (NVIDIA) count: all capabilities: [gpu] healthcheck: test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] interval: 15s timeout: 5s retries: 3 start_period: 30sFor AMD GPUs, use image: ollama/ollama:rocm. For CPU-only, omit the deploy.resources block — expect ~2-5 tok/s on a 32B model.
Pull models after starting Ollama:
# Chat model (pick one per tier — see Model Selection above)docker exec ollama ollama pull qwen2.5:32b
# Embedding modeldocker exec ollama ollama pull nomic-embed-text
# Utility model (optional, for contextual retrieval / query rewriting)docker exec ollama ollama pull qwen2.5:7bSkipper env vars for Ollama:
LLM_PROVIDER=ollamaLLM_MODEL=qwen2.5:32bLLM_API_URL=http://ollama:11434/v1 # use container name in Docker networks
EMBEDDING_PROVIDER=ollamaEMBEDDING_MODEL=nomic-embed-textEMBEDDING_API_URL=http://ollama:11434 # embeddings use /api/embeddings, not /v1SearXNG (Web Search)
Section titled “SearXNG (Web Search)”SearXNG is a self-hosted metasearch engine. No API key needed.
docker run -d --name searxng -p 8080:8080 searxng/searxng:latestSkipper env vars:
SEARCH_PROVIDER=searxngSEARCH_API_URL=http://searxng:8080Self-Hosted Reranker
Section titled “Self-Hosted Reranker”For fully local retrieval quality, run a cross-encoder model via Text Embeddings Inference (TEI) or a similar /v1/rerank-compatible server:
docker run -d --name reranker -p 8787:80 \ ghcr.io/huggingface/text-embeddings-inference:latest \ --model-id BAAI/bge-reranker-v2-m3Skipper env vars:
RERANKER_PROVIDER=genericRERANKER_MODEL=BAAI/bge-reranker-v2-m3RERANKER_API_URL=http://reranker:8787Putting It Together
Section titled “Putting It Together”A fully self-hosted stack with proper health checks and startup ordering:
services: ollama: image: ollama/ollama:latest volumes: - ollama_data:/root/.ollama deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] healthcheck: test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] interval: 15s timeout: 5s retries: 5 start_period: 30s
searxng: image: searxng/searxng:latest healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"] interval: 10s timeout: 5s retries: 3
skipper: # ... your existing Skipper config ... environment: LLM_PROVIDER: ollama LLM_MODEL: qwen2.5:32b LLM_API_URL: http://ollama:11434/v1 EMBEDDING_PROVIDER: ollama EMBEDDING_MODEL: nomic-embed-text EMBEDDING_API_URL: http://ollama:11434 SEARCH_PROVIDER: searxng SEARCH_API_URL: http://searxng:8080 depends_on: ollama: condition: service_healthy searxng: condition: service_healthy
volumes: ollama_data:Architecture
Section titled “Architecture”graph TD
Q["User Query"] --> PRE["Pre-retrieval<br/><small>auto · every message · fast path</small>"]
Q --> TOOL["search_knowledge<br/><small>explicit LLM tool call</small>"]
Q --> WEB["search_web<br/><small>explicit LLM tool call</small>"]
PRE --> EMB1["Embed Query"]
EMB1 --> HYB1["Hybrid Search"]
HYB1 --> RERANK1["Cross-Encoder Rerank"]
RERANK1 --> DEDUP1["Deduplicate"]
DEDUP1 --> LLM
TOOL --> REWRITE["Query Rewrite<br/><small>utility LLM</small>"]
REWRITE --> HYDE{"HyDE<br/>enabled?"}
HYDE -->|yes| HYPO["Generate Hypothetical Answer<br/><small>utility LLM → embed</small>"]
HYDE -->|no| EMB2["Embed Rewritten Query"]
HYPO --> HYB2["Hybrid Search"]
EMB2 --> HYB2
HYB2 --> RERANK2["Cross-Encoder Rerank"]
RERANK2 --> DEDUP2["Deduplicate"]
DEDUP2 --> LLM
WEB --> REWRITE2["Query Rewrite<br/><small>utility LLM</small>"]
REWRITE2 --> SEARCH["Web Search Provider"]
SEARCH --> LLM
LLM["LLM + MCP Tools"] --> CONF["Confidence Tagging"]
CONF --> RESP["Response with Citations"]
Skipper runs as the api_consultant service (ports 18018 HTTP, 19007 gRPC). It connects to the API Gateway via MCP to access platform tools (diagnostics, stream management, GraphQL). The Gateway proxies Skipper’s ask_consultant tool to external MCP agents, which runs the full orchestrator pipeline internally.
Retrieval and Reranking
Section titled “Retrieval and Reranking”Every user message triggers an automatic pre-retrieval pass that searches both tenant-specific and global knowledge. The LLM can also explicitly call search_knowledge for targeted lookups.
Search uses a hybrid approach: 70% cosine vector similarity + 30% PostgreSQL full-text ranking. Results are then reranked — when a cross-encoder is configured (RERANKER_PROVIDER), it scores (query, chunk) pairs together for semantic understanding; otherwise a keyword-overlap heuristic is used (0.7 × vector similarity + 0.3 × query term overlap). Results are deduplicated to a maximum of 2 chunks per source URL.
Query rewriting (requires utility LLM) transforms conversational questions into search-optimized queries before embedding. This bridges vocabulary gaps between how users phrase questions and how documentation is written. Applied to search_knowledge and search_web tool calls; skipped for pre-retrieval to keep latency low.
HyDE (SKIPPER_ENABLE_HYDE=true) generates a hypothetical answer via the utility LLM, then embeds that answer instead of the question for vector search. The resulting vector is closer in embedding space to real documentation. Adds ~500-1500ms latency per search_knowledge call.
Dependencies
Section titled “Dependencies”| Dependency | Purpose |
|---|---|
| PostgreSQL + pgvector | Vector store, conversations, usage tracking |
| LLM provider | Chat completion (OpenAI, Anthropic, or Ollama) |
| Embedding provider | Document/query embedding (OpenAI or Ollama) |
| API Gateway (MCP) | Platform tools — diagnostics, stream CRUD, GraphQL |
| Periscope (gRPC) | Stream health metrics (via Gateway) |
| Commodore (gRPC) | Tenant and stream context |
Heartbeat Monitoring
Section titled “Heartbeat Monitoring”Skipper periodically analyzes the health of active streams and infrastructure. The heartbeat agent runs every HEARTBEAT_INTERVAL (default 30 minutes) and processes each eligible tenant.
How It Works
Section titled “How It Works”For each tenant with active streams and a qualifying billing tier:
- Snapshot — fetches stream health and client QoE metrics from Periscope for the last 15 minutes
- Baseline comparison — compares current metrics against Welford running averages that the heartbeat has been building over time. Deviations are detected when a metric exceeds 2 standard deviations from the mean (requires at least 5 samples to avoid false positives during warmup).
- Triage — a deterministic decision cascade (no LLM calls):
- Hard threshold violation → investigate
- Cross-metric correlation with ≥ 50% confidence → investigate
- Baseline deviations → flag for review
- Everything normal → skip
- Per-stream drill-down — when something looks wrong, Skipper fetches per-stream metrics, compares each against the tenant-wide baseline, and identifies the most anomalous streams (up to 20).
- Investigation — only when the triage result is “investigate”, Skipper runs the chat orchestrator with the full diagnostic context (deviations, correlations, per-stream anomalies, raw metrics). This is the only step that uses LLM tokens.
- Notification — investigation reports and flag summaries are dispatched via email, WebSocket, or MCP.
Healthy tenants consume zero LLM calls per heartbeat cycle.
Cross-Metric Correlation
Section titled “Cross-Metric Correlation”The diagnostics engine matches deviation patterns against 5 known failure hypotheses: network degradation, encoder overload, viewer-side issues, ingest instability, and CDN pressure. Each hypothesis has expected signal patterns (e.g., network degradation = packet_loss↑ + bandwidth_in↓ + buffer_health↓). Confidence is calculated as matched signals / total expected signals.
Infrastructure Monitoring
Section titled “Infrastructure Monitoring”Independently of per-tenant stream health, the heartbeat checks node-level metrics across all active clusters:
- CPU ≥ 95% and memory ≥ 95% require the violation to persist across 3 of 4 five-minute windows before alerting (prevents transient spikes)
- Disk ≥ 90% triggers a warning; ≥ 95% is critical (fires immediately since disk doesn’t self-heal)
- Alerts are emailed to the cluster owner with a 4-hour cooldown per node/alert type
Configuration
Section titled “Configuration”Heartbeat:
| Variable | Purpose | Default |
|---|---|---|
HEARTBEAT_INTERVAL | How often to run the heartbeat cycle | 30m |
SKIPPER_REQUIRED_TIER_LEVEL | Minimum billing tier for heartbeat processing | 3 |
Notifications (email):
| Variable | Purpose | Default |
|---|---|---|
SMTP_HOST | SMTP server hostname | — |
SMTP_PORT | SMTP server port | 587 |
SMTP_FROM | Sender email address | — |
SMTP_USERNAME | SMTP authentication username | — |
SMTP_PASSWORD | SMTP authentication password | — |
Social Posting
Section titled “Social Posting”When enabled, Skipper drafts social media posts based on noteworthy platform events. Posts are sent as drafts to a configured email address for human review — nothing is auto-published.
How It Works
Section titled “How It Works”The social agent checks for noteworthy events every SKIPPER_SOCIAL_INTERVAL (default 2 hours):
- Event collection — the heartbeat agent and knowledge crawler push signals into a collector as they run (platform stats, federation metrics, newly embedded pages)
- Detection — the detector classifies signals, compares against stored baselines, and scores them:
- Platform stats: new viewer record, bandwidth milestone (1/10/100/1000 Gbps), significant viewer surge (>25% growth)
- Federation: latency improvement (>20% drop), event volume milestone
- Knowledge: newly crawled and embedded documentation page
- Composition — the utility LLM drafts a tweet (max 280 characters). It receives the last 10 posts to avoid repeating themes. If the draft exceeds 280 characters, it retries once, then truncates at the nearest word boundary.
- Publishing — the draft is saved to the database and emailed to
SKIPPER_SOCIAL_NOTIFY_EMAILfor review.
The first observation for each signal type saves a baseline instead of posting — subsequent signals are compared against it.
Configuration
Section titled “Configuration”| Variable | Purpose | Default |
|---|---|---|
SKIPPER_SOCIAL_ENABLED | Enable the social posting agent | false |
SKIPPER_SOCIAL_INTERVAL | How often to check for noteworthy events | 2h |
SKIPPER_SOCIAL_MAX_PER_DAY | Max posts per day (0 = unlimited) | 2 |
SKIPPER_SOCIAL_NOTIFY_EMAIL | Email to send draft tweets to (required when enabled) | — |