Open source · bayesian-cage

A confidence gate
for MCP tool calls.

bayesian-cage is an open-source confidence gate for MCP tool calls. Every tool output comes back PROCEED, FLAG, or BLOCK — with a calibrated confidence you can reproduce.

In enforce mode a BLOCK is withheld and returned to the host as an MCP error, so a compliant client won't act on it; advisory mode — the default — labels the call and passes it through. Independent verification, outside the model's own loop. MIT licensed.

bayesian-cage / gated MCP call
tool: mcp:kb:lookup  ·  query: capital of Australia?
grounding_checkPROCEED0.94
consistency_checkFLAG0.52
tool said “Sydney”BLOCK0.26
⛔ BLOCK — contradiction: tool returned “Sydney”, sources say “Canberra”. Returned to host as an MCP error; the agent does not act.

Sits between the agent and its tools.

The cage is a drop-in MCP proxy. It fronts your tool servers, verifies every output before it reaches the agent, and refuses to pass on what it can't stand behind.

01

Front your MCP servers

Point Claude Desktop, Cursor, or any MCP host at the cage instead of directly at your tools. It spawns your real MCP server as a stdio subprocess and is transparent until a check fails.

pipx install bayesian-cage
02

Every output is verified

Grounding and consistency checks score each tool result against evidence — not against the model's own say-so. The result is a calibrated confidence plus a per-tool belief state that compounds across calls.

verify → calibrate → gate → observe
03

PROCEED · FLAG · BLOCK

Above the commit threshold the call passes through. Thin evidence is flagged. In enforce mode a contradiction is BLOCKed and returned to the host as an MCP error, so a compliant client never acts on it; advisory mode (the default) labels and passes through.

BLOCK → isError: true

Three decisions. One rule. No guessing.

Every gated tool call resolves to exactly one of three decisions, grounded in a calibrated confidence rather than the model's self-assessment.

Gate evaluation — runs on every tool output
PROCEED
≥ 0.72

The output clears the commit threshold. It passes through to the agent unchanged. The per-tool belief updates upward.

FLAG
0.40 – 0.72

Evidence is thin or partially inconsistent. The output is passed with a flag so the agent (or a human) can treat it as unverified.

BLOCK
< 0.40

A contradiction or failed check. Returned to the host as an MCP error. The agent does not act on it.


Calibration you can reproduce.

The whole point: the gate is better calibrated than the model trusting itself. Execution-graded text-to-SQL — phi-3 via Ollama, 5-fold (seed=7), 67.3% accuracy — the cage's calibrated confidence vs phi-3's own.

cage vs raw phi-3 · 55-task text-to-SQL · 5-fold, seed=7 · Ollama
metriccageraw phi-3
ECE — calibration (lower better)0.0810.325
Brier (lower better)0.1740.322
catch-rate (higher better)33%0%
acts on wrong outputs (lower better)1218
AUROC (higher better)0.5440.583

One model, one task family (n=55): a calibration result, not a generalization claim. phi-3's raw confidence isn't discriminative — nearly every answer comes back ~1.0, so AUROC sits near chance whether you ask the model or the cage. What the cage buys is calibration: ECE ~4× tighter and a third of wrong answers caught, at zero correct answers blocked. Reproduce with python -m bayesian_cage.eval.sqlbench.run --model phi3 --seed 7 — methodology in the research and the repo.


An open roadmap.

The cage is v0.1 and built in the open. The work ahead is about scaling the verification layer and proving calibration on harder ground.

Now

stdio MCP gating + reproducible evals. Gate any stdio MCP server today via BAYESIAN_CAGE_DOWNSTREAM; the SQL calibration bench ships with the repo (bayesian_cage.eval.sqlbench) so anyone can rerun the numbers.

Next

Remote/HTTP downstreams + harder benchmarks. HTTP/remote MCP servers behind the cage, plus runs on external adversarial sets (TruthfulQA, HaluEval) beyond text-to-SQL.

Next

Per-tool gate policy. Per-tool thresholds and verifier routing on top of the shipped heuristic / SQL / JSON / filesystem / ensemble verifiers, so a SQL tool and a web-search tool can be held to different standards.



Not a confidence score.
A posterior probability.

The cage maintains a Beta(α, β) belief per tool — the correct distribution for binary outcomes. α increments when a tool's output verifies, β when it doesn't. The posterior mean α/(α+β) calibrates the gate, and a forgetting factor lets recent behaviour outweigh stale history.

Grounded verification follows Cox (1946) — probability as the only consistent logic of uncertainty — and treats each tool output as a hypothesis to be checked against evidence, not asserted. The commit threshold (0.72) and flag floor (0.40) are named constants, not magic numbers. Every decision is auditable.

Beta(α, β) · posterior mean = α / (α + β) · per-tool belief

Bayes, 1763Cox, 1946Pearl, 1988Jaynes, 2003