Blog

Thinking about Bayesian evaluation.

Predicate design, evidence-based scoring, and what the data actually says.

2026-06-048 min read

IS(current LLM, clever): A Formal Evaluation

We ran IS(current LLM, clever) through Bayescore's own scoring framework. Five predicates. Each evaluated against the published literature and observed model behavior. Final score: 23/100, Grade F.

2026-06-017 min read

Fluent vs. Clever: Why the Best-Writing AI Is Not the Best-Reasoning AI

The model that writes best is not necessarily the model that reasons best. Fluency optimizes for text that sounds right. Cleverness requires text that is right — calibrated to evidence, resistant to social pressure, sensitive to what is absent.

2026-05-296 min read

What Is AI Cleverness?

Fluency is not cleverness. A model that produces confident, well-structured text is demonstrating next-token prediction. Cleverness is something harder: calibrated belief updating under evidence.

2026-05-245 min read

The Predicate Problem: Why Most Evaluation Questions Cannot Be Answered

Most evaluation questions fail before the LLM sees them. The problem is the question design — ambiguous, unfalsifiable, or scoped so broadly that any answer is defensible.

2026-05-246 min read

What Your Score Is Actually Measuring

The score is not a rating. It is the output of a locked formula — Σ(weight × confidence) × 100 — where confidence is assigned per predicate based on what evidence the document actually contains.

Page 1 of 3