The Bayesian Case for Scoring Documents Instead of Reviewing Them

The problem with qualitative review

The standard model of document evaluation is qualitative review: a person reads a pitch, a proposal, or a spec and renders a verdict. This model has well-documented failure modes. Reviewers anchor on early impressions. They weight confident language over evidenced claims. They penalize unfamiliar formats. They disagree with each other.

These failures are not random errors — they are systematic biases. They produce evaluations that vary not just between reviewers but within the same reviewer across different contexts. The same pitch that gets funded at one firm gets declined at another, not because the pitch changed, but because the evaluation process is not calibrated.

The Bayesian alternative is not "use AI to review the pitch." It is something more principled: express the evaluation as a probability problem, apply the axioms of probability theory, and produce a number that means something.

Cox's theorem and why it matters

In 1946, Richard T. Cox published "Probability, Frequency and Reasonable Expectation," which proved that any consistent method for representing degrees of belief under uncertainty must satisfy the axioms of probability theory. This is not a philosophical preference — it is a mathematical result. If you want your beliefs to be internally consistent, they must behave like probabilities.

The consequence for evaluation is direct: a qualitative review that produces a verdict ("promising," "not ready," "pass") is not expressing a probability. It cannot be updated consistently as new evidence arrives. It cannot be aggregated across evaluators. It cannot be compared across time. A Bayesian confidence score can do all three.

Bayescore's score is a posterior probability: P(criterion | evidence in document). Given the evidence the document contains — the customer interviews cited, the demand signals documented, the acquisition channel specified — what is the probability that the document meets the criterion? This is not a rating. It is a degree of belief, expressed in the only consistent language available for degrees of belief.

Pearl's Bayesian Networks and the IS notation

Judea Pearl's 1988 book Probabilistic Reasoning in Intelligent Systems introduced Bayesian networks — directed acyclic graphs where each node represents a random variable and each edge encodes a probabilistic dependency. Pearl showed that complex joint probability distributions over many variables could be efficiently represented and updated using these graphical structures.

Bayescore's IS(subject, criterion) notation is its own evaluation primitive, structurally informed by Pearl's Bayesian network formalism. Each domain is a simplified belief network: a binary root node — IS(subject, criterion) — with weighted evidence nodes (predicates) feeding into it. The predicates are binary: either the evidence is present in the document, or it is not. The weight on each predicate reflects its prior importance to the root hypothesis.

The formula that computes the final score — Σ(weight × confidence) × 100 — is a weighted sum of the confidence values across all predicate nodes. It is not a scoring rubric. It is a Bayesian belief update expressed in a form that scales across any document type and any evaluation criterion.

Jaynes and the epistemological foundation

E.T. Jaynes spent decades arguing that probability theory is not just a tool for frequency-based statistics — it is the correct language for reasoning under uncertainty in any domain. His 2003 book Probability Theory: The Logic of Science made this case systematically: probability is extended logic, the generalization of classical Boolean logic to degrees of certainty.

The Jaynes perspective justifies applying Bayesian methods to document evaluation in particular. Documents are not random samples from a population. They are structured arguments made under conditions of incomplete information. Probability theory — as extended logic — is exactly the right framework for reasoning about structured arguments under uncertainty. You do not need frequency data to apply Bayes. You need a prior and a likelihood. Both are available for any evaluation domain.

What two-pass adversarial evaluation adds

The theoretical framework tells you how to combine evidence into a score. It does not tell you how to extract the evidence from a document. This is the role of the two-pass evaluation design.

Pass 1 (supportive): extract all evidence that supports each predicate. What does the document say that counts in favor of customer validation? What counts in favor of the demand signal? This pass is generous: every piece of genuine evidence is extracted.

Pass 2 (adversarial): extract all evidence against each predicate. What would a skeptic say is missing? What should be in a document of this type but isn't? This pass is the mechanism by which absence becomes informative. A document that makes no mention of customer interviews fails the customer_validation predicate under adversarial review — because customer discovery is what that class of document is supposed to address, and the omission is a choice, not an oversight.

The two-pass design operationalizes the Bayesian principle of updating on all available evidence — including the absence of expected evidence. Single-pass evaluation cannot do this, because it can only surface what is present. The adversarial pass is what makes absence informative.

Why this is not another AI evaluation tool

Most AI document evaluation tools use a large language model to read a document and produce qualitative feedback. The model is prompted to "assess the strengths and weaknesses" or "score this document from 1 to 10." These tools inherit the exact failure modes of human reviewers: they anchor on confident language, they vary across runs, and they cannot be updated consistently as evidence changes.

Bayescore uses an LLM differently: as an evidence extractor, not an evaluator. The model reads the document and answers a specific, binary question for each predicate: is this evidence present, absent, or partial? That binary output is then fed into a locked formula. The model provides the evidence extraction; probability theory provides the evaluation. The score is not the model's opinion. It is what Cox, Pearl, and Jaynes would compute from the evidence the model found.

Get a calibrated probability score, not a review.

Drop any structured document. Bayescore extracts the evaluation DNA, runs two-pass adversarial scoring, and returns a Bayesian confidence score grounded in Cox, Pearl, and Jaynes — not reviewer opinion.

Score your document →