A confidence gate
for MCP tool calls.
bayesian-cage is an open-source confidence gate for MCP tool calls. Every tool output comes back PROCEED, FLAG, or BLOCK — with a calibrated confidence you can reproduce.
In enforce mode a BLOCK is withheld and returned to the host as an MCP error, so a compliant client won't act on it; advisory mode — the default — labels the call and passes it through. Independent verification, outside the model's own loop. MIT licensed.
Sits between the agent and its tools.
The cage is a drop-in MCP proxy. It fronts your tool servers, verifies every output before it reaches the agent, and refuses to pass on what it can't stand behind.
Front your MCP servers
Point Claude Desktop, Cursor, or any MCP host at the cage instead of directly at your tools. It spawns your real MCP server as a stdio subprocess and is transparent until a check fails.
pipx install bayesian-cageEvery output is verified
Grounding and consistency checks score each tool result against evidence — not against the model's own say-so. The result is a calibrated confidence plus a per-tool belief state that compounds across calls.
verify → calibrate → gate → observePROCEED · FLAG · BLOCK
Above the commit threshold the call passes through. Thin evidence is flagged. In enforce mode a contradiction is BLOCKed and returned to the host as an MCP error, so a compliant client never acts on it; advisory mode (the default) labels and passes through.
BLOCK → isError: trueThree decisions. One rule. No guessing.
Every gated tool call resolves to exactly one of three decisions, grounded in a calibrated confidence rather than the model's self-assessment.
The output clears the commit threshold. It passes through to the agent unchanged. The per-tool belief updates upward.
Evidence is thin or partially inconsistent. The output is passed with a flag so the agent (or a human) can treat it as unverified.
A contradiction or failed check. Returned to the host as an MCP error. The agent does not act on it.
Calibration you can reproduce.
The whole point: the gate is better calibrated than the model trusting itself. Execution-graded text-to-SQL — phi-3 via Ollama, 5-fold (seed=7), 67.3% accuracy — the cage's calibrated confidence vs phi-3's own.
| metric | cage | raw phi-3 |
|---|---|---|
| ECE — calibration (lower better) | 0.081 | 0.325 |
| Brier (lower better) | 0.174 | 0.322 |
| catch-rate (higher better) | 33% | 0% |
| acts on wrong outputs (lower better) | 12 | 18 |
| AUROC (higher better) | 0.544 | 0.583 |
One model, one task family (n=55): a calibration result, not a generalization claim. phi-3's raw confidence isn't discriminative — nearly every answer comes back ~1.0, so AUROC sits near chance whether you ask the model or the cage. What the cage buys is calibration: ECE ~4× tighter and a third of wrong answers caught, at zero correct answers blocked. Reproduce with python -m bayesian_cage.eval.sqlbench.run --model phi3 --seed 7 — methodology in the research and the repo.
An open roadmap.
The cage is v0.1 and built in the open. The work ahead is about scaling the verification layer and proving calibration on harder ground.
stdio MCP gating + reproducible evals. Gate any stdio MCP server today via BAYESIAN_CAGE_DOWNSTREAM; the SQL calibration bench ships with the repo (bayesian_cage.eval.sqlbench) so anyone can rerun the numbers.
Remote/HTTP downstreams + harder benchmarks. HTTP/remote MCP servers behind the cage, plus runs on external adversarial sets (TruthfulQA, HaluEval) beyond text-to-SQL.
Per-tool gate policy. Per-tool thresholds and verifier routing on top of the shipped heuristic / SQL / JSON / filesystem / ensemble verifiers, so a SQL tool and a web-search tool can be held to different standards.
Built in the open.
MIT licensed and contribution-first. The fastest way to help is to run the cage against your own tools and tell us where it's wrong.
Read the code
The gate, the verifiers, and the calibration rig are all in the repo. Star it, fork it, open an issue. Star count is the signal we listen to.
Report a miss
Found a wrong PROCEED or a false BLOCK? That's the most valuable bug we can get. Bring the tool, the input, and what you expected.
Add a verifier
The verifier interface is small on purpose. New grounding strategies and per-domain checks are the highest-leverage contributions.
Reproduce the numbers
Every calibration claim ships with the script that produced it. Rerun it on your own model and post what you get.
Not a confidence score.
A posterior probability.
The cage maintains a Beta(α, β) belief per tool — the correct distribution for binary outcomes. α increments when a tool's output verifies, β when it doesn't. The posterior mean α/(α+β) calibrates the gate, and a forgetting factor lets recent behaviour outweigh stale history.
Grounded verification follows Cox (1946) — probability as the only consistent logic of uncertainty — and treats each tool output as a hypothesis to be checked against evidence, not asserted. The commit threshold (0.72) and flag floor (0.40) are named constants, not magic numbers. Every decision is auditable.
Beta(α, β) · posterior mean = α / (α + β) · per-tool belief