Thinking about Bayesian evaluation.
Predicate design, evidence-based scoring, and what the data actually says.
Predicate design, evidence-based scoring, and what the data actually says.
We ran IS(current LLM, clever) through Bayescore's own scoring framework. Five predicates. Each evaluated against the published literature and observed model behavior. Final score: 23/100, Grade F.
Read more →The model that writes best is not necessarily the model that reasons best. Fluency optimizes for text that sounds right. Cleverness requires text that is right — calibrated to evidence, resistant to social pressure, sensitive to what is absent.
Read more →Fluency is not cleverness. A model that produces confident, well-structured text is demonstrating next-token prediction. Cleverness is something harder: calibrated belief updating under evidence.
Read more →Most evaluation questions fail before the LLM sees them. The problem is the question design — ambiguous, unfalsifiable, or scoped so broadly that any answer is defensible.
Read more →The score is not a rating. It is the output of a locked formula — Σ(weight × confidence) × 100 — where confidence is assigned per predicate based on what evidence the document actually contains.
Read more →