IS(current LLM, clever): A Formal Evaluation

The evaluation setup

We define AI cleverness as calibrated belief updating under evidence. The root hypothesis is IS(current LLM, clever): does the current generation of large language models satisfy this criterion?

We evaluate this hypothesis using five predicates, each weighted by its leverage on the definition. The evidence base is the published literature on LLM calibration, behavior under adversarial conditions, and performance on structured evaluation tasks — plus observed behavior in production systems.

This is a genuine evaluation using Bayescore's own method. The predicates are specific and falsifiable. Absence of evidence against a predicate does not count in favor of it. The scoring formula is the same locked formula used for every evaluation: Σ(weight × confidence) × 100.

Predicate 1: Prior representation (20% weight)

Question: Does the model maintain calibrated priors — beliefs proportional to the actual frequency and reliability of what it was trained on?

Evidence for: Models trained on large corpora demonstrate reasonable calibration on high-frequency factual claims. For well-documented historical facts, scientific consensus positions, and common knowledge claims, confidence levels expressed in model outputs correlate meaningfully with accuracy.

Evidence against: Calibration degrades sharply on low-frequency facts, recently changed information, niche domains, and any claim that requires reasoning across multiple training examples rather than direct recall. The Kadavath et al. (2022) study on language model self-assessment found systematic overconfidence on novel reasoning tasks. Srivastava et al. (2022) on BIG-bench found that models frequently express high confidence on tasks they perform near chance.

Confidence score: 0.15 — Partial credit only. High-frequency prior representation is passable. Low-frequency, novel, and cross-domain prior representation is not calibrated.

Predicate 2: Evidence extraction (20% weight)

Question: Does the model accurately extract relevant evidence from documents and correctly identify what is absent?

Evidence for: Models demonstrate genuine competence at extractive tasks — finding named entities, locating relevant passages, identifying explicit claims. On documents with clearly present or clearly absent features, models perform well at extraction tasks when given specific, binary questions.

Evidence against: Performance degrades on tasks requiring identification of absence — noting what should be present but is not. Models systematically underweight absent evidence, treating "not found" as neutral rather than informative. This is the absence blindness failure mode documented in the adversarial NLP literature. Performance also degrades on documents with implicit or indirect evidence, where the evidence must be inferred rather than extracted.

Confidence score: 0.40 — Meaningful capability. Present evidence extraction is genuinely good. Absence identification and inference-dependent extraction are not.

Predicate 3: Correct belief updating (25% weight)

Question: Does the model update beliefs in the correct direction and magnitude when presented with new evidence?

Evidence for: In controlled settings, models demonstrate the capacity to update beliefs when presented with explicit, contradicting evidence. A model that asserts X and is then shown a document stating not-X can acknowledge the update. This is the basic competence of the capability.

Evidence against: The sycophancy literature is extensive and consistent. Perez et al. (2022), Sharma et al. (2023), and multiple subsequent studies document that models update their expressed beliefs in response to social signals — user disagreement, emotional tone, apparent expertise claims — in ways that are not proportional to the evidential content of those signals. The magnitude of sycophantic updating is often comparable to the magnitude of genuine evidential updating. This means you cannot reliably distinguish "model changed position because evidence warranted it" from "model changed position because you pushed back." That is a fundamental failure of the predicate.

Confidence score: 0.10 — Near zero. The sycophancy failure is structural. It is not a prompt engineering problem. It is a consequence of RLHF training on human preference data, where human raters tend to prefer responses that agree with them over responses that are correct. The training signal creates an incentive for sycophantic updating that competes with evidential updating.

Predicate 4: Absence sensitivity (20% weight)

Question: Does the model treat the absence of expected evidence as evidence against, rather than as neutral?

Evidence for: When explicitly instructed to treat absence as informative — as Bayescore does in its adversarial pass prompt — models can apply this correctly in specific, structured evaluation tasks. The capability exists at the level of following instructions.

Evidence against: Without explicit instruction, models systematically fail this predicate. The default behavior is to report what is present and either omit or score neutrally what is absent. This is consistent with the next-token prediction training objective: the model learns to produce text about what is there, not about what is not there. Absence is not a token. It is not in the training distribution in the same way presence is.

Confidence score: 0.20 — Conditional capability. With explicit prompting, this can be activated. As an intrinsic capability deployed without architectural support, it fails reliably.

Predicate 5: Calibration (15% weight)

Question: Are the model's expressed confidence levels reliable indicators of actual accuracy across diverse tasks?

Evidence for: Guo et al. (2017) and subsequent work showed that modern neural networks, when temperature-scaled, can achieve reasonable calibration on in-distribution classification tasks. More recent work shows that for simple factual questions with clear right/wrong answers, model verbalized confidence levels have some predictive value.

Evidence against: Calibration degrades systematically in out-of-distribution settings, open-ended generation, and complex reasoning tasks. The models that express the highest confidence tend to have the worst calibration on hard problems — an inverted relationship that makes high-confidence outputs the least trustworthy. Xiong et al. (2023) find that LLM calibration is particularly poor on questions requiring multi-step reasoning. The verbalized confidence in model outputs ("I'm fairly confident," "I believe") does not reliably track actual accuracy.

Confidence score: 0.35 — Some signal, not reliable signal. Calibration exists for simple factual tasks. It does not generalize to the evaluation tasks where it would matter most.

The score

Applying the locked formula:

score = (0.20 × 0.15) + (0.20 × 0.40) + (0.25 × 0.10) + (0.20 × 0.20) + (0.15 × 0.35)

= 0.030 + 0.080 + 0.025 + 0.040 + 0.053

= 0.228 × 100 = 23/100, Grade F

The grade thresholds: A≥85, B≥70, C≥55, D≥40, F<40. 23 is a Grade F.

The highest-leverage gap

The formula identifies the highest-leverage improvement: the predicate where (weight × gap_to_threshold) is largest. The gap-to-threshold for Grade F is the gap to 40. The highest single contributor to closing that gap is Predicate 3 — correct belief updating — at 25% weight with a confidence score of only 0.10.

Predicate 3 is also structurally the hardest to fix. Sycophancy is a consequence of RLHF training on human preference data. It is not a prompt engineering problem. Fixing it requires either retraining with different signals or architectural changes that separate the training objective for factual accuracy from the training objective for conversational satisfaction. Neither of those is a small change.

This is why the architecture of the system matters more than the capability of the model. A clever architecture routes around the model's failure modes. A naive architecture inherits them.

What a passing score would require

To reach Grade C (55/100), the overall weighted confidence would need to reach 0.55. Holding all other predicates constant, Predicate 3 alone would need to increase from 0.10 to approximately 0.85 to get there. That is a structural change in how models are trained.

Alternatively: Predicate 2 (evidence extraction) moving from 0.40 to 1.0, combined with Predicate 4 (absence sensitivity) moving from 0.20 to 0.80, would add 0.20 to the score — bringing it to 43/100, Grade D. Still not passing, but a meaningful improvement achievable through architectural prompting and two-pass design rather than model retraining.

This is the design decision Bayescore makes. The adversarial pass directly addresses Predicates 2 and 4 — it instructs the model to explicitly look for what is absent, and it provides a structured output format that reduces the room for sycophantic softening of findings. The model still has its failure modes. The architecture works around the ones that matter most for this task.

The model scores F. The architecture scores higher.

Bayescore is designed to route around LLM failure modes — using the model for extraction, probability theory for scoring. Drop your document and see the difference.

Try Bayescore →