What Your Score Is Actually Measuring

What the score formula computes

The score is not a rating. It is the output of a locked formula: Σ(weight × confidence) × 100. Every predicate in the evaluation domain carries a fixed weight. The LLM assigns a confidence value per predicate based on what evidence the document contains. The weights sum to 1. The score is the weighted sum of those confidence values, scaled to 100.

A document with no evidence — no customer interviews, no demand signals, no market data — receives confidence values near zero on the predicates that require that evidence. The score reflects the absence. This is not a judgment about the business. It is an accurate report of what the document contains.

Predicate weights are fixed per domain. In the self-evaluation domain, customer validation carries 18% weight, demand signal carries 16%, and risk identification carries 6%. These weights reflect a considered judgment about the relative importance of each factor to the root hypothesis — not empirical regression coefficients, but explicit, disputable, published priors.

How evidence moves the score

Each predicate gets a confidence score from 0.0 to 1.0. Strong, specific evidence produces high confidence. Absent or vague evidence produces low confidence. The two-pass adversarial evaluation — one supportive pass, one adversarial — is designed to surface both what is present and what is missing. The supportive pass extracts genuine evidence. The adversarial pass finds gaps that the supportive pass would have left unchallenged.

The weight of each predicate determines how much its confidence value contributes to the total. A predicate carrying 18% weight contributes three times more than one carrying 6%. This is why the highest-weight failing predicate is the highest-leverage item at any score level — not the first predicate alphabetically, and not the one you feel best about addressing.

What a score of 34 means

A score of 34/100 means the posterior probability is approximately 0.34 — given the evidence in the document, there is a 34% chance the subject meets the stated criterion. Some predicates are passing: those represent real evidence that the criterion is met. Some are failing: those represent genuine gaps.

This number is not a pass or fail threshold. It is a calibrated estimate. The grade (D for scores in the 40–54 range, F below 40) is a human-readable translation of the posterior into a familiar scale, but the score itself is the meaningful output.

A score of 34 with three high-weight predicates failing is different from a score of 34 with eight low-weight predicates failing. The first has a single identifiable lever. The second has diffuse gaps. The findings section of the report shows which situation you are in.

What the score does not measure

The score does not measure the quality of the idea, the capability of the team, or the potential of the business. It measures the evidence available in the document at the time of evaluation.

Amazon in 1995 would have scored an F. The score would have correctly identified that there was no formal demand signal, no documented unit economics, and no articulated go-to-market plan. These were real gaps — not disqualifying ones, but real. The score would have pointed to exactly the work that Bezos eventually did.

A score is a snapshot of the evidence state at one moment. As evidence improves — as customer interviews are conducted, demand signals are captured, acquisition channels are tested — the score should improve. If it does not, the evidence is not as strong as it seems, or the predicate definitions need refinement.

The right way to read a failing grade

An F grade is a repair map. It shows exactly which predicates are failing, in which direction, and by how much. The highest-weight failing predicate is the single highest-leverage action available to you right now.

Founders who treat a failing score as a verdict are using it wrong. Founders who treat it as a priority list — and then act on the highest item — are using it correctly. The score does not tell you whether to proceed. It tells you what evidence you need to gather before that question becomes answerable.

Get your calibrated score

Paste your document and Bayescore returns a posterior probability score, letter grade, and the single highest-leverage gap to close.

Evaluate now →