Mesh Networking
FrameWorks backend hosts use a WireGuard mesh managed by Privateer. Operators
generate GitOps-owned identity, then the CLI renders host config during
provisioning. The manifest, hosts.enc.yaml, and Privateer runtime own this
layer; hand-written wg0.conf files are not the supported operating path.
Why this exists
Section titled “Why this exists”Kafka quorum, Yugabyte master traffic, and every other internal RPC ride the WireGuard mesh managed by Privateer. That creates a bootstrap cycle: Quartermaster stores peer state but lives on top of the DB, which itself speaks over the mesh. The cluster can’t start if the mesh needs Quartermaster and Quartermaster needs the mesh.
We break the cycle by letting GitOps carry just enough mesh state to bring
wg0 up before Quartermaster exists. Once Quartermaster comes up over the
mesh, Privateer overlays managed peers, DNS, and PKI transparently.
Three layers
Section titled “Three layers”There is no “static mode” or “dynamic mode” to flip. Identity and seed are
rendered during provisioning. The managed layer appears after the first
successful SyncMesh and is then cached on disk.
| Layer | Owner | Lives in | Contains |
|---|---|---|---|
| Identity | GitOps (mesh wg generate) | cluster.yaml + SOPS hosts.enc.yaml | per-host wireguard_ip, wireguard_public_key, wireguard_port, wireguard_private_key |
| Seed peers + DNS | GitOps (rendered by Ansible) | /etc/privateer/static-peers.json | same-cluster mesh peers plus required cross-cluster service, infra, and federation peers |
| Managed peers + DNS + PKI | Quartermaster | live via SyncMesh gRPC; cached in /var/lib/privateer/last_known_mesh.json after the first successful sync | dependency-derived topology |
Topology model
Section titled “Topology model”Privateer does not build a global all-to-all service network. Provisioning and runtime sync derive mesh reachability from three graphs:
| Graph | Source of truth | Examples |
|---|---|---|
| Direct service calls | pkg/topology/dependencies.go, backed by actual client construction in service binaries | Bridge → Quartermaster/Commodore/Purser/Periscope/Signalman/Decklog, Purser → Periscope for invoice enrichment, Skipper ↔ Bridge MCP |
| Infrastructure calls | rendered manifest endpoints | Postgres/Yugabyte, Kafka brokers, ClickHouse, Redis/Sentinel |
| Discovery and federation | Quartermaster service registry and tenant cluster peer metadata | Foghorn peer-to-peer federation, runtime service DNS, service bootstrap |
.internal DNS is for dependencies of services running on the local node. A
regional node can resolve a central singleton such as quartermaster.internal
because a local service declares that dependency and there is one provider
context. Sibling media clusters do not become visible just because they exist.
Some direct service calls are intentionally global. Skipper receives an ordered
GATEWAY_MCP_URLS list of concrete Bridge mesh host endpoints so the single
central Skipper prefers the Bridge in its own region, then fails over through
the remaining Bridge hosts deterministically. Regional Bridges keep
skipper.internal as the reciprocal spoke endpoint, so every Gateway MCP hub
can still expose Skipper’s tools.
Foghorn federation still uses concrete peer Foghorn addresses from
Quartermaster’s tenant cluster context rather than a global foghorn.internal
alias.
Runtime-enrolled nodes use the same graph. When an infrastructure_node
bootstrap token carries desired_service_types and desired_cluster_ids
metadata, Quartermaster persists that onto the node and uses it for seed peers
and SyncMesh before the services have produced their first service_instances
or service_cluster_assignments rows. GitOps and adopted-local nodes normally
get the same intent from the rendered manifest and service registry.
Kafka and other infrastructure are part of the same dependency policy even
though they are not FrameWorks application services. Bootstrap registers
Postgres/Yugabyte, Kafka brokers, ClickHouse, and Redis/Sentinel endpoints as
peer-only topology providers in Quartermaster. If a service uses
KAFKA_BROKERS, Privateer seeds and managed sync include the selected broker
hosts without creating a kafka.internal service DNS alias. This is why a host
can have no cross-cluster service DNS but still require a cross-cluster mesh
peer for aggregator Kafka or central Yugabyte.
Node-origin vocabulary
Section titled “Node-origin vocabulary”Every row in Quartermaster’s infrastructure_nodes carries an
enrollment_origin that says how the row came to be and who owns the
node’s WireGuard private key.
| Value | Cold-boot capable | Private key lives in | How to create |
|---|---|---|---|
gitops_seed | yes | SOPS (hosts.enc.yaml) | declared in cluster.yaml and generated by frameworks mesh wg generate |
runtime_enrolled | no | node’s local disk | frameworks mesh join --token ... --bootstrap-addr ... — Privateer generates the keypair locally and enrolls via Bridge |
adopted_local | yes | node’s local disk | frameworks mesh reconcile --write-gitops promotes a runtime_enrolled node into GitOps without exporting its key |
Promotion paths:
runtime_enrolled→adopted_localviaframeworks mesh reconcile --write-gitops. The node’s public identity lands incluster.yaml;hosts.enc.yamlrecordswireguard_private_key_managed: falseso Ansible preserves the on-disk key.adopted_local→gitops_seedis a three-step promotion, intentionally split so Quartermaster doesn’t claim GitOps authority before the running node has adopted the new key:frameworks mesh wg rotate <host>— writes a fresh SOPS-managed private key and strips the preserve-key markers fromhosts.enc.yaml. Does not touch Quartermaster.frameworks cluster provision— Ansible renders the new/etc/privateer/wg.keyand Privateer restarts with it. The node’s nextSyncMeshpropagates the new public key.frameworks mesh wg promote <host>— verifies that Quartermaster’s recorded public key now matches the manifest, then flipsenrollment_origintogitops_seed. If the key hasn’t converged yet, promote fails with a retry-after-SyncMesh message rather than claiming stale authority.
Cold-boot-critical nodes (DB, Quartermaster itself, Kafka quorum) must be
gitops_seed or adopted_local. runtime_enrolled is for convenience
replicas and stateless helpers until promoted — do not rely on one for
quorum.
The runtime-layer source field on /var/lib/privateer/last_known_mesh.json
("seed" or "dynamic") is independent and describes how the last sync was
applied — it does not persist across Quartermaster state.
Mental model: seed boots, managed runs, enrollment grows, reconcile promotes.
Precedence
Section titled “Precedence”Per sync tick inside Privateer:
SyncMeshagainst Quartermaster succeeded this tick → apply the response. Metricprivateer_layer_applied{layer="managed"}=1.- Agent just started and
last_known_mesh.jsonis present → apply it. Metricprivateer_layer_applied{layer="last_known"}=1. - Otherwise apply
static-peers.json. Metricprivateer_layer_applied{layer="seed"}=1.
Identity (own mesh IP, listen port, private key) is never read from the snapshot — only from disk. A rotated key in GitOps always propagates on reboot; a stale cache cannot resurrect the old self-address.
Operator workflow
Section titled “Operator workflow”One pass, no maintenance windows:
# 1. Generate identity + peer metadata for every host in the clusterframeworks mesh wg generate \ --manifest gitops/clusters/<cluster>/cluster.yaml \ --hosts-file gitops/clusters/<cluster>/hosts.enc.yaml
# 2. Commit the gitops diff (public keys, mesh IPs, ports) and SOPS updategit add gitops/clusters/<cluster>/{cluster.yaml,hosts.enc.yaml}git commit -m "mesh: regenerate WG identity for <cluster>"git push
# 3. Validate identity in CI or before a prod runframeworks mesh wg check --manifest gitops/clusters/<cluster>/cluster.yaml
# 4. Provision. This is read-only against GitOps.frameworks cluster provision --gitops-dir gitops --cluster <cluster>mesh wg generateis deterministic and idempotent: re-running it on an already-populated manifest is a no-op.mesh wg rotate <host>regenerates one host’s key without touching the rest. Pass--readdressonly when the host’s mesh IP must change.cluster provisionnever writescluster.yamlorhosts.enc.yaml. If identity is incomplete, it fails with the exactmesh wg generatecommand to run before retrying.- Public fields land in plaintext
cluster.yamlviayaml.v3node edits (comments and order preserved). - Private keys land in the SOPS-encrypted
hosts.enc.yamlvia the canonical decrypt → edit → re-encrypt → replace flow.
cluster.yaml shape
Section titled “cluster.yaml shape”wireguard: enabled: true mesh_cidr: 10.88.0.0/16 listen_port: 51820
hosts: core-1: external_ip: 203.0.113.10 wireguard_ip: 10.88.0.2 wireguard_public_key: <44-char base64> wireguard_port: 51820 roles: [control]There is no bootstrap_mode field. Anything that references one is stale
documentation.
hosts.enc.yaml shape (SOPS-encrypted)
Section titled “hosts.enc.yaml shape (SOPS-encrypted)”hosts: core-1: external_ip: 203.0.113.10 user: deploy wireguard_private_key: <44-char base64>On-disk layout (per host, after provisioning)
Section titled “On-disk layout (per host, after provisioning)”| Path | Source | Purpose |
|---|---|---|
/etc/privateer/wg.key | Ansible render of SOPS private key | WG private key (0600 privateer:privateer) |
/etc/privateer/privateer.env | Ansible render | SERVICE_TOKEN, QUARTERMASTER_GRPC_ADDR, MESH_WIREGUARD_IP, etc. |
/etc/privateer/static-peers.json | Ansible render | seed peer set and seed DNS aliases, sha256-versioned |
/var/lib/privateer/last_known_mesh.json | Privateer | last successful SyncMesh snapshot; absent until the first managed sync |
/etc/wireguard/wg0.conf | Privateer | composed at runtime from the layers above |
Runtime
Section titled “Runtime”- Systemd starts
frameworks-privateer. - Agent reads identity:
MESH_WIREGUARD_IP,MESH_LISTEN_PORT,/etc/privateer/wg.key. These are authoritative for self-address. - Agent calls
applyStartupMesh(precedence rules above). Seed DNS includes host aliases and service aliases such asquartermaster.internal. runLoopticksSyncMeshagainst Quartermaster forever. While QM is unreachable the metric stays onlast_knownorseed.- First successful
SyncMeshflips the metric tomanagedand writes a freshlast_known_mesh.jsonwithsource=dynamicand a server-computedmesh_revision.
Diagnostics
Section titled “Diagnostics”wg0has no peers:cat /etc/privateer/static-peers.json;journalctl -u frameworks-privateer -f.internaldoes not resolve:systemctl status systemd-resolved;getent hosts quartermaster.internal- Stuck on seed after QM is up:
grep QUARTERMASTER /etc/privateer/privateer.env;ss -tnlp | grep 19002on the QM host - Stale peer IP after rotation:
cat /var/lib/privateer/last_known_mesh.json; restart privateer on the affected host - Which layer is live right now:
curl -s localhost:18012/metrics | grep privateer_layer_applied - GitOps identity drifted from Quartermaster:
frameworks mesh wg audit --cluster-filter <cluster_id>— reports per-host differences and exits non-zero on anygitops_seed/adopted_localmismatch
Key rotation
Section titled “Key rotation”frameworks mesh wg rotate core-2git commit -am "mesh: rotate wg key for core-2"git pushframeworks cluster provision --gitops-dir gitops --cluster <cluster>Peers converge across the cluster within two sync intervals: seed is
re-rendered first, Quartermaster picks up the new public key from the next
provision’s self-seed, and every other node’s next SyncMesh returns the
new peer entry.