Reference

Bayesian Evaluation Glossary.

Every term in BayesCore's evaluation architecture — defined precisely, with no borrowed vocabulary. From evaluation DNA and IS notation to weighted predicates, posterior confidence, and highest-leverage gaps.

Bayes' Theorem#

The mathematical rule governing belief update under evidence: P(H|E) = P(E|H) · P(H) / P(E). Given hypothesis H and new evidence E, the posterior probability of H equals the likelihood of observing E if H were true, scaled by the prior probability of H and normalized by the marginal probability of E. BayesCore applies this rule to update confidence in each evaluation predicate as document evidence is processed through two adversarial passes.

Bayesian Confidence Score#

A value (0–1) expressing the degree of belief that an evaluation predicate is satisfied, given the available evidence in a document. Assigned by an LLM reasoning over document evidence at temperature 0 — not a subjective rating and not editorially adjusted. Defaults toward 0 when evidence is absent (absence is evidence against, not neutral). Multiplied by predicate weight and summed across all predicates to produce the document's overall score on a 0–100 scale via the locked formula: Σ(weight × confidence) × 100.

Bayesian Scoring#

An evaluation methodology treating confidence as degree of belief rather than a count of satisfied criteria. Grounded in Bayes (1763), Cox (1946), Pearl (1988, 2000), and Jaynes (2003). The core claim: uncertainty about document quality is best represented probabilistically. A document that partially satisfies many criteria is scored differently — and more accurately — than one that fully satisfies a few.

Bayesian Kernel#

The core runtime component of the BayesCore desktop app. A local Python engine that implements the full Bayesian update cycle: prior → likelihood → posterior → action → observe → update. The kernel maintains a Beta(α, β) distribution per task type (BeliefState), applies prior correction to LLM outputs (IntentEngine), gates action behind a commit threshold, and closes the feedback loop by calling observe() on every task completion. The kernel is architecturally grounded in Bayes' theorem — P(H|E) = P(E|H) · P(H) / P(E) — not as a metaphor but as the literal computation performed at intent routing time.

Belief Compounding#

The property of a Bayesian kernel by which per-task-type belief distributions become more calibrated with each interaction. Each task completion is a binary observation that updates Beta(α, β) via the conjugate rule: α+=1 on success, β+=1 on failure. After N interactions, P(success | task_type) = α/(α+β) is a posterior mean informed by N data points — not a static estimate. A kernel used for 100 document evaluations has a more calibrated routing prior than a kernel used for 10. This compounding is the architectural moat: stateless assistants reset to Beta(1,1) on every session; a Bayesian kernel does not.

BeliefState#

The persistent probability model maintained by the BayesCore Bayesian kernel — a dictionary of Beta(α, β) distributions keyed by task type, intent label, routing target, or any observable binary outcome. Serialised to disk and restored at app startup. At session boundaries, a forgetting factor (FORGETTING_FACTOR=0.9) decays pseudocounts toward the uninformative prior Beta(1,1) — preventing early observations from permanently dominating. BeliefState.p_success(key) returns the posterior mean α/(α+β), which is the prior P(H) used in the IntentEngine's Bayesian update.

Cox's Theorem#

Richard Cox's 1946 result proving that probability theory is the only consistent extension of Boolean logic to degrees of belief. Any system of reasoning under uncertainty that satisfies basic consistency requirements — transitivity, continuity, and agreement with Boolean logic at certainty — must obey the axioms of probability. Establishes that Bayesian inference is not one possible approach but the uniquely correct one. One of BayesCore's four theoretical foundations.

Coordination Overhead#

The communication and synchronization cost incurred by adding agents to a collaborative inference process. Scales at O(n²) in the number of agents — each additional person or team adds n−1 new communication channels, each of which degrades belief transmission fidelity. The Bayesian framing: coordination overhead is a tax on posterior quality. Every handoff introduces noise, delay, and interpretation variance that prevents beliefs from propagating accurately. Adding team size past a threshold increases coordination overhead faster than it increases evidence-gathering capacity, producing net negative returns on inference quality. One of three failure modes in the Bayesian resource allocation model alongside posterior quality and prior overconfidence.

Commit Threshold#

The minimum posterior probability required for the BayesCore IntentEngine to commit to an action. Set at INTENT_COMMIT_THRESHOLD=0.72. When the top intent's posterior P(intent|query) exceeds 0.72, the kernel executes the corresponding agent. When it does not, the kernel asks a minimal clarifying question designed to maximise entropy reduction over the ambiguous intents — not a generic "what do you mean?" The commit threshold implements the Bayesian decision rule: act only when the posterior is sufficiently concentrated. Below threshold, the cost of a wrong action exceeds the expected value of executing.

DNA Extraction#

The first operation in a BayesCore evaluation. A structured document is read at temperature 0 to surface the implicit evaluation criteria it already contains — the predicates it must satisfy to prove its root hypothesis. The extraction is deterministic: the same document class produces the same predicate structure on every run, regardless of document instance. No predefined templates or forms are required; criteria emerge from the document itself.

Document Class#

A category of structured artifacts that share the same implicit evaluation criteria. Each class implies a specific root hypothesis: agent outputs imply IS(output, verified); research briefs imply IS(evidence, sufficient); product specs imply IS(product, claims_supported); policy memos imply IS(recommendation, evidence_based); hiring rubrics imply IS(candidate, hireable). Different instances of the same class share a predicate structure extracted once and reused across the evaluation domain.

Evaluation DNA#

The complete set of evaluation predicates implicit in a document — together with their weights and the root hypothesis they collectively test. Named by analogy: just as DNA encodes biological function, evaluation DNA encodes the criteria a document must satisfy to prove its root hypothesis. Evaluation DNA is extracted at temperature 0, making it deterministic and document-derived rather than template-imposed. Two documents of the same class share the same DNA structure; two documents of different classes do not.

Evaluation Domain#

A reusable predicate structure saved as a shareable UUID — the root hypothesis, predicates, and weights for a specific document class. Domains are either built-in (the Document Soundness domain ships with BayesCore, free) or custom (extracted from any structured document and persisted via the API). Once a domain exists, it can be applied to any document of the same class via POST /api/scan. Custom domains are a Pro feature.

Evidence-Based Scoring#

A scoring methodology in which confidence estimates are grounded exclusively in observable document content rather than subjective judgment, rubric completion, or qualitative impression. Each predicate confidence score is derived from document text reviewed in two adversarial passes — one extracting supporting evidence, one searching for counter-evidence and gaps. A document cannot score well by assertion alone; the evidence must be present.

EV-Ranked Scheduling#

The task ordering mechanism of the BayesCore ProbabilisticScheduler. Tasks are queued and executed in descending order of expected value: EV = P(success | task_type) × utility − cost. P(success | task_type) is read directly from BeliefState — the same posterior means that power intent routing. This coherence requirement ensures a single belief model governs both what to do and when to do it. A task with high utility but low P(success) and high cost can rank below a lower-utility task with better odds. EV-ranking replaces FIFO, user-declared priority, and static ordering with a Bayesian decision criterion.

Evaluation Graph#

BayesCore's internal representation of an evaluation domain as a directed acyclic graph. Nodes are predicates. The root node is the root hypothesis expressed in IS(subject, criterion) notation. Edges encode how predicate evidence propagates to the root belief. The architecture draws on Bayesian network theory — specifically Pearl's formalization of probabilistic inference in directed graphs (Probabilistic Reasoning in Intelligent Systems, 1988) — but the evaluation graph structure and IS notation are BayesCore's own design, not a named concept from Pearl's work.

FORGETTING_FACTOR#

A decay coefficient (value: 0.9) applied to BeliefState pseudocounts at every session boundary. Prevents early observations from permanently dominating the posterior. At session end: α_new = 1.0 + (α − 1.0) × 0.9; β_new = 1.0 + (β − 1.0) × 0.9. This shrinks pseudocounts toward Beta(1,1) — the uninformative prior — without resetting entirely. A value of 1.0 would preserve all history indefinitely (overconfidence risk). A value of 0.0 would reset to the uninformative prior on every session (no compounding). 0.9 balances recent-evidence dominance against long-run calibration.

Grade Band#

A letter-grade mapping applied to the final document score. Fixed across all evaluation domains: A (85–100), B (70–84), C (55–69), D (40–54), F (0–39). The grade bands are calibrated against the scoring formula — not against a distribution of user submissions — so a Grade A means the document provides strong evidence across its predicate structure, not merely that it outperformed other documents. BayesCore scored itself 97/100, Grade A on IS(document, claims_supported).

Highest-Leverage Gap#

The single failing predicate whose improvement would produce the largest increase in the overall confidence score. Computed as weight × (threshold − confidence) across all predicates, returning the one with the maximum value. The most efficient point of intervention — fixing the highest-leverage gap before any other predicate yields more score per unit of effort than any alternative action. BayesCore returns exactly one per evaluation.

IS Network#

Informal shorthand for BayesCore's evaluation graph when framed around its IS(subject, criterion) root hypothesis. The IS(subject, criterion) notation is BayesCore's own primitive — IS stands for the binary claim "Is this [subject] a [criterion]?". The evaluation graph makes that hypothesis explicit and structures predicate evidence beneath it.

INTENT_COMMIT_THRESHOLD#

The named constant (value: 0.72) that gates action in the BayesCore IntentEngine. The top intent's posterior probability must exceed 0.72 for the kernel to commit to execution. Below 0.72, the posterior is too diffuse — at least two intents are plausible enough that executing the top one risks a costly misroute. The threshold of 0.72 is set to balance two error types: false commits (executing the wrong intent) and false hesitations (asking for clarification when the intent was obvious). Lowering it increases throughput at the cost of more misroutes; raising it reduces misroutes at the cost of more clarification prompts.

IntentEngine#

The BayesCore kernel module that applies Bayes' theorem to intent routing. The LLM at temperature=0 produces a calibrated probability distribution over possible intents — this is the likelihood P(E|H). The IntentEngine multiplies each intent's likelihood by its prior P(H) from BeliefState, renormalises the product, and produces the posterior P(H|E). This is not a retrieval or classification step — it is literal application of P(H|E) ∝ P(E|H) · P(H). The top posterior drives routing; the commit threshold gates action. If the top posterior is below INTENT_COMMIT_THRESHOLD, the engine formulates a clarifying question targeting maximum entropy reduction.

Inference Bottleneck#

The step in the belief-updating chain — evidence gathering, belief updating, or coordination — that is most limiting organizational output. The primary diagnostic question for resource allocation: where is the bottleneck? If evidence is sparse, adding resources to gather more will update posteriors meaningfully. If beliefs are updated poorly despite sufficient evidence, more resources are irrelevant — the reasoning process is broken. If coordination overhead dominates, adding people worsens the bottleneck. Resources address only the first case; the other two require structural changes. Identifying the inference bottleneck before allocating resources is the Bayesian alternative to headcount growth as a default response to falling output.

IS Notation#

The formal expression of a root hypothesis as IS(subject, criterion). IS(proposal, claims_supported) reads: 'Does this proposal support its claims?' IS(application, approved) reads: 'Is this application approved?' IS(evidence, sufficient) reads: 'Is the evidence sufficient?' IS(recommendation, evidence_based) reads: 'Is this recommendation grounded in evidence?' The notation makes the evaluation hypothesis explicit and testable before any evidence is gathered. Every evaluation domain has exactly one root hypothesis in IS notation.

Jaynes' Probability Theory#

E.T. Jaynes' 2003 formulation of probability theory as extended logic — a normative framework for reasoning under uncertainty derived from first principles. Jaynes established that Bayesian inference is not one interpretation of probability among many but the uniquely correct method for updating beliefs given evidence, under any consistent set of desiderata. Probability Theory: The Logic of Science (2003) is one of BayesCore's four theoretical foundations alongside Bayes (1763), Cox (1946), and Pearl (1988, 2000).

Local LLM#

A large language model running on the user's own hardware — not a cloud API. The BayesCore desktop app routes raw document content to a local LLM (via Ollama, LM Studio, or any OpenAI-compatible endpoint) for processing. Only extracted scored claims — never the raw document text, user queries, or BeliefState — cross the network boundary to the BayesCore scoring API. This architecture preserves two properties simultaneously: user privacy (sensitive documents never leave the device) and BayesCore's server-side scoring moat (predicates, DNA formula, and domain manifests remain server-side and cannot be decompiled).

Pearl's Bayesian Networks#

Judea Pearl's framework for probabilistic reasoning in directed acyclic graphs, introduced in Probabilistic Reasoning in Intelligent Systems (1988) and extended via do-calculus and causal inference in Causality (2000). Pearl established how beliefs propagate through a graph of conditionally dependent variables, and how interventions (do-calculus) differ from observations. BayesCore's evaluation architecture draws on this framework — predicate nodes correspond to variables in a Bayesian network, and the scoring formula formalizes how evidence at the predicate level updates confidence in the root hypothesis.

Posterior Probability#

In Bayesian inference, the updated degree of belief in a hypothesis after incorporating evidence: P(H|E). Contrasts with the prior — the baseline belief before any evidence is considered. BayesCore's confidence scores are structured analogously: an LLM reasons over document evidence at temperature 0, defaulting to near-zero when evidence is absent and assigning higher values only when explicit, verifiable support exists. A score of 0.85 means the document provides strong explicit evidence for that predicate; 0.2 means the evidence is weak, absent, or contradictory.

Posterior Quality#

The accuracy and calibration of beliefs after evidence is incorporated. High-quality posteriors are proportional to the evidence — neither overconfident nor underconfident. Low-quality posteriors arise from three sources: insufficient evidence (evidence bottleneck), poor updating methods (inference bottleneck), or coordination overhead that degrades belief transmission across agents. Posterior quality — not evidence volume or team size — is the proximate determinant of decision quality. In BayesCore's evaluation model, every resource allocation question ultimately reduces to: will this intervention improve posterior quality at the current bottleneck step?

Predicate#

A binary variable representing one evaluation criterion in an evaluation domain. Each predicate poses a yes/no question about the document — for example, "Does the document provide validated demand signals?" or "Is the acquisition channel specified and defensible?" Predicates have two properties: a weight (relative importance in the domain) and a confidence score (LLM-assigned confidence 0.0–1.0 based on evidence in the document). The built-in Document Soundness domain has 8 fixed predicates. Pro users can extract custom predicates from any source document via DNA extraction.

Predicate-Based Evaluation#

An evaluation methodology that decomposes a root hypothesis into independently weighted binary criteria — predicates — each assessed separately and combined via the weighted scoring formula. Predicate-based evaluation differs from holistic scoring (a single overall impression) and from checklist scoring (all items equally weighted) by treating each criterion as a probabilistic variable with its own weight and confidence. The result is a decomposable, auditable score with a clear attribution of where evidence is strong or weak.

Predicate Weight#

A coefficient (0–1) assigned to each predicate reflecting its relative importance to the root hypothesis. Weights sum to 1 across all predicates in a domain. In the built-in Document Soundness domain IS(document, claims_supported), the eight predicate weights are: central_claim 18%, evidence_support 16%, scope_defined 14%, assumptions_stated 14%, success_criteria 12%, risks_acknowledged 12%, next_steps 8%, internal_consistency 6%. Custom domains (Pro) derive weights from the source document.

Prior Correction#

The operation performed by the IntentEngine to transform a raw LLM output into a Bayesian posterior. The LLM at temperature=0 produces P(E|H) — a calibrated probability distribution expressing how likely each intent is given the query evidence. Prior correction multiplies each intent's likelihood by its prior P(H) from BeliefState, then renormalises so probabilities sum to 1.0. The result is P(H|E) ∝ P(E|H) · P(H) — a posterior that combines the LLM's in-context judgment with the kernel's accumulated history of what has worked. Without prior correction, every routing decision treats all intents as equally likely before evidence is considered. With it, a task type that has succeeded 80% of the time gets a boosted posterior relative to one that has succeeded 30% of the time, even with identical LLM likelihoods.

Prior Overconfidence (Organizational)#

A systematic bias in which resource abundance shifts organizational priors toward overconfidence — the implicit prior that the current approach is working. When resources are plentiful, the evidence threshold required to trigger belief revision rises: teams can continue on a failing path longer before consequences force an update. Resource scarcity counteracts this by making every negative signal immediately costly, enforcing aggressive belief updating. The Bayesian explanation for why constrained teams frequently outperform well-resourced ones on per-unit output: scarcity suppresses this bias by collapsing the gap between evidence arrival and belief revision. The 46/100 self-score BayesCore published at launch is a deliberate defense against this bias.

Prior Probability#

The baseline degree of belief in a predicate before any document evidence is considered. In BayesCore's evaluation, priors reflect the base rate of predicate satisfaction for a given document class. The two adversarial passes update the prior into a posterior. BayesCore uses a non-informative prior by default — the document must establish predicate confidence through its own content, with no assumed credit for belonging to a particular category.

Root Hypothesis#

The testable proposition that defines what an artifact is trying to prove, expressed in IS notation as IS(subject, criterion). Every evaluation domain has exactly one root hypothesis. Predicates are the sub-claims that collectively constitute evidence for or against it. The root hypothesis is made explicit before scoring begins — it is the evaluative question the artifact must answer. Example: an agent output is implicitly claiming IS(output, verified). BayesCore makes that claim explicit and tests it predicate by predicate.

Temperature 0#

The inference temperature at which BayesCore runs all evaluation passes — DNA extraction, Pass 1 (evidence), and Pass 2 (adversarial). Temperature 0 suppresses stochastic sampling and maximizes determinism: the same document produces the same predicate structure and confidence scores on every run. This is a non-negotiable design constraint. A non-zero temperature introduces sampling variance that violates the locked formula: score = Σ(weight × confidence) × 100. Evaluation criteria must be discovered from the document, not invented by random sampling.

Two-Pass Adversarial Evaluation#

BayesCore's core scoring mechanism. Pass 1 (evidence extraction): reads the document and identifies all content that supports each predicate — direct statements, data, examples, and citations. Pass 2 (counter-evidence): actively searches for gaps, omissions, contradictions, and missing information that undercut predicate confidence. Both passes inform the final confidence score per predicate. The adversarial structure prevents confirmation bias: it is not enough for a document to assert a predicate; the document must also survive scrutiny for what it fails to say.

Weighted Predicate Scoring#

The scoring formula underlying every BayesCore evaluation: score = Σ(weight × confidence) × 100. Each predicate's confidence (0–1) is multiplied by its weight (0–1), summed across all predicates in the domain, and scaled to a 0–100 integer. The formula is fixed across all domains — the same arithmetic governs pitch decks, grant proposals, product specs, and any other document class. What varies between domains is the predicate structure and weight distribution, not the formula itself.