Skip to content

Mesh Networking

FrameWorks backend hosts use a WireGuard mesh managed by Privateer. Operators generate GitOps-owned identity, then the CLI renders host config during provisioning. The manifest, hosts.enc.yaml, and Privateer runtime own this layer; hand-written wg0.conf files are not the supported operating path.

Kafka quorum, Yugabyte master traffic, and every other internal RPC ride the WireGuard mesh managed by Privateer. That creates a bootstrap cycle: Quartermaster stores peer state but lives on top of the DB, which itself speaks over the mesh. The cluster can’t start if the mesh needs Quartermaster and Quartermaster needs the mesh.

We break the cycle by letting GitOps carry just enough mesh state to bring wg0 up before Quartermaster exists. Once Quartermaster comes up over the mesh, Privateer overlays managed peers, DNS, and PKI transparently.

There is no “static mode” or “dynamic mode” to flip. Identity and seed are rendered during provisioning. The managed layer appears after the first successful SyncMesh and is then cached on disk.

LayerOwnerLives inContains
IdentityGitOps (mesh wg generate)cluster.yaml + SOPS hosts.enc.yamlper-host wireguard_ip, wireguard_public_key, wireguard_port, wireguard_private_key
Seed peers + DNSGitOps (rendered by Ansible)/etc/privateer/static-peers.jsonsame-cluster mesh peers plus required cross-cluster service, infra, and federation peers
Managed peers + DNS + PKIQuartermasterlive via SyncMesh gRPC; cached in /var/lib/privateer/last_known_mesh.json after the first successful syncdependency-derived topology

Privateer does not build a global all-to-all service network. Provisioning and runtime sync derive mesh reachability from three graphs:

GraphSource of truthExamples
Direct service callspkg/topology/dependencies.go, backed by actual client construction in service binariesBridge → Quartermaster/Commodore/Purser/Periscope/Signalman/Decklog, Purser → Periscope for invoice enrichment, Skipper ↔ Bridge MCP
Infrastructure callsrendered manifest endpointsPostgres/Yugabyte, Kafka brokers, ClickHouse, Redis/Sentinel
Discovery and federationQuartermaster service registry and tenant cluster peer metadataFoghorn peer-to-peer federation, runtime service DNS, service bootstrap

.internal DNS is for dependencies of services running on the local node. A regional node can resolve a central singleton such as quartermaster.internal because a local service declares that dependency and there is one provider context. Sibling media clusters do not become visible just because they exist. Some direct service calls are intentionally global. Skipper receives an ordered GATEWAY_MCP_URLS list of concrete Bridge mesh host endpoints so the single central Skipper prefers the Bridge in its own region, then fails over through the remaining Bridge hosts deterministically. Regional Bridges keep skipper.internal as the reciprocal spoke endpoint, so every Gateway MCP hub can still expose Skipper’s tools. Foghorn federation still uses concrete peer Foghorn addresses from Quartermaster’s tenant cluster context rather than a global foghorn.internal alias.

Runtime-enrolled nodes use the same graph. When an infrastructure_node bootstrap token carries desired_service_types and desired_cluster_ids metadata, Quartermaster persists that onto the node and uses it for seed peers and SyncMesh before the services have produced their first service_instances or service_cluster_assignments rows. GitOps and adopted-local nodes normally get the same intent from the rendered manifest and service registry.

Kafka and other infrastructure are part of the same dependency policy even though they are not FrameWorks application services. Bootstrap registers Postgres/Yugabyte, Kafka brokers, ClickHouse, and Redis/Sentinel endpoints as peer-only topology providers in Quartermaster. If a service uses KAFKA_BROKERS, Privateer seeds and managed sync include the selected broker hosts without creating a kafka.internal service DNS alias. This is why a host can have no cross-cluster service DNS but still require a cross-cluster mesh peer for aggregator Kafka or central Yugabyte.

Every row in Quartermaster’s infrastructure_nodes carries an enrollment_origin that says how the row came to be and who owns the node’s WireGuard private key.

ValueCold-boot capablePrivate key lives inHow to create
gitops_seedyesSOPS (hosts.enc.yaml)declared in cluster.yaml and generated by frameworks mesh wg generate
runtime_enrollednonode’s local diskframeworks mesh join --token ... --bootstrap-addr ... — Privateer generates the keypair locally and enrolls via Bridge
adopted_localyesnode’s local diskframeworks mesh reconcile --write-gitops promotes a runtime_enrolled node into GitOps without exporting its key

Promotion paths:

  • runtime_enrolledadopted_local via frameworks mesh reconcile --write-gitops. The node’s public identity lands in cluster.yaml; hosts.enc.yaml records wireguard_private_key_managed: false so Ansible preserves the on-disk key.
  • adopted_localgitops_seed is a three-step promotion, intentionally split so Quartermaster doesn’t claim GitOps authority before the running node has adopted the new key:
    1. frameworks mesh wg rotate <host> — writes a fresh SOPS-managed private key and strips the preserve-key markers from hosts.enc.yaml. Does not touch Quartermaster.
    2. frameworks cluster provision — Ansible renders the new /etc/privateer/wg.key and Privateer restarts with it. The node’s next SyncMesh propagates the new public key.
    3. frameworks mesh wg promote <host> — verifies that Quartermaster’s recorded public key now matches the manifest, then flips enrollment_origin to gitops_seed. If the key hasn’t converged yet, promote fails with a retry-after-SyncMesh message rather than claiming stale authority.

Cold-boot-critical nodes (DB, Quartermaster itself, Kafka quorum) must be gitops_seed or adopted_local. runtime_enrolled is for convenience replicas and stateless helpers until promoted — do not rely on one for quorum.

The runtime-layer source field on /var/lib/privateer/last_known_mesh.json ("seed" or "dynamic") is independent and describes how the last sync was applied — it does not persist across Quartermaster state.

Mental model: seed boots, managed runs, enrollment grows, reconcile promotes.

Per sync tick inside Privateer:

  1. SyncMesh against Quartermaster succeeded this tick → apply the response. Metric privateer_layer_applied{layer="managed"}=1.
  2. Agent just started and last_known_mesh.json is present → apply it. Metric privateer_layer_applied{layer="last_known"}=1.
  3. Otherwise apply static-peers.json. Metric privateer_layer_applied{layer="seed"}=1.

Identity (own mesh IP, listen port, private key) is never read from the snapshot — only from disk. A rotated key in GitOps always propagates on reboot; a stale cache cannot resurrect the old self-address.

One pass, no maintenance windows:

Terminal window
# 1. Generate identity + peer metadata for every host in the cluster
frameworks mesh wg generate \
--manifest gitops/clusters/<cluster>/cluster.yaml \
--hosts-file gitops/clusters/<cluster>/hosts.enc.yaml
# 2. Commit the gitops diff (public keys, mesh IPs, ports) and SOPS update
git add gitops/clusters/<cluster>/{cluster.yaml,hosts.enc.yaml}
git commit -m "mesh: regenerate WG identity for <cluster>"
git push
# 3. Validate identity in CI or before a prod run
frameworks mesh wg check --manifest gitops/clusters/<cluster>/cluster.yaml
# 4. Provision. This is read-only against GitOps.
frameworks cluster provision --gitops-dir gitops --cluster <cluster>
  • mesh wg generate is deterministic and idempotent: re-running it on an already-populated manifest is a no-op. mesh wg rotate <host> regenerates one host’s key without touching the rest. Pass --readdress only when the host’s mesh IP must change.
  • cluster provision never writes cluster.yaml or hosts.enc.yaml. If identity is incomplete, it fails with the exact mesh wg generate command to run before retrying.
  • Public fields land in plaintext cluster.yaml via yaml.v3 node edits (comments and order preserved).
  • Private keys land in the SOPS-encrypted hosts.enc.yaml via the canonical decrypt → edit → re-encrypt → replace flow.
wireguard:
enabled: true
mesh_cidr: 10.88.0.0/16
listen_port: 51820
hosts:
core-1:
external_ip: 203.0.113.10
wireguard_ip: 10.88.0.2
wireguard_public_key: <44-char base64>
wireguard_port: 51820
roles: [control]

There is no bootstrap_mode field. Anything that references one is stale documentation.

hosts:
core-1:
external_ip: 203.0.113.10
user: deploy
wireguard_private_key: <44-char base64>

On-disk layout (per host, after provisioning)

Section titled “On-disk layout (per host, after provisioning)”
PathSourcePurpose
/etc/privateer/wg.keyAnsible render of SOPS private keyWG private key (0600 privateer:privateer)
/etc/privateer/privateer.envAnsible renderSERVICE_TOKEN, QUARTERMASTER_GRPC_ADDR, MESH_WIREGUARD_IP, etc.
/etc/privateer/static-peers.jsonAnsible renderseed peer set and seed DNS aliases, sha256-versioned
/var/lib/privateer/last_known_mesh.jsonPrivateerlast successful SyncMesh snapshot; absent until the first managed sync
/etc/wireguard/wg0.confPrivateercomposed at runtime from the layers above
  1. Systemd starts frameworks-privateer.
  2. Agent reads identity: MESH_WIREGUARD_IP, MESH_LISTEN_PORT, /etc/privateer/wg.key. These are authoritative for self-address.
  3. Agent calls applyStartupMesh (precedence rules above). Seed DNS includes host aliases and service aliases such as quartermaster.internal.
  4. runLoop ticks SyncMesh against Quartermaster forever. While QM is unreachable the metric stays on last_known or seed.
  5. First successful SyncMesh flips the metric to managed and writes a fresh last_known_mesh.json with source=dynamic and a server-computed mesh_revision.
  • wg0 has no peers: cat /etc/privateer/static-peers.json; journalctl -u frameworks-privateer -f
  • .internal does not resolve: systemctl status systemd-resolved; getent hosts quartermaster.internal
  • Stuck on seed after QM is up: grep QUARTERMASTER /etc/privateer/privateer.env; ss -tnlp | grep 19002 on the QM host
  • Stale peer IP after rotation: cat /var/lib/privateer/last_known_mesh.json; restart privateer on the affected host
  • Which layer is live right now: curl -s localhost:18012/metrics | grep privateer_layer_applied
  • GitOps identity drifted from Quartermaster: frameworks mesh wg audit --cluster-filter <cluster_id> — reports per-host differences and exits non-zero on any gitops_seed / adopted_local mismatch
Terminal window
frameworks mesh wg rotate core-2
git commit -am "mesh: rotate wg key for core-2"
git push
frameworks cluster provision --gitops-dir gitops --cluster <cluster>

Peers converge across the cluster within two sync intervals: seed is re-rendered first, Quartermaster picks up the new public key from the next provision’s self-seed, and every other node’s next SyncMesh returns the new peer entry.