Interactive Demo

Watch the kernel run a confidence-gated pipeline in real time.

No signup. See the belief state, confidence gate, and adversarial verification on a real task.

sample input — research verification task

Grant Application Excerpt — AI Calibration Research Initiative This proposal requests $450,000 over 24 months to develop calibrated uncertainty quantification methods for large language model outputs. Problem: Current LLM evaluation benchmarks measure average accuracy but not confidence calibration. A model scoring 85% on standard benchmarks may be overconfident on the majority of correct answers, producing well-stated but unreliable claims in deployment. Proposed Method: We apply Bayesian scoring to LLM output — treating each claim as a hypothesis and computing P(claim | evidence) using a conjugate Beta prior updated by retrieval-augmented evidence extraction. Prior Work: Our team published calibration analysis across three frontier models (NeurIPS 2024). Mean confidence gap was 0.23 across 1,200 sampled claims. Codebase and data released under MIT license. Team: PI with 9 years in Bayesian statistics (Berkeley, CMU). Co-PI specializing in NLP evaluation methodology. Budget Justification: $180K salaries, $120K compute, $60K dissemination, $90K indirect costs. Expected Outcomes: Open-source calibration toolkit, three peer-reviewed papers, reproducible benchmark suite.