Documentation
Everything you need to connect your tools, run confidence-gated pipelines, and verify agent output.
Getting Started
BayesCore is a hosted service — no installation, no account required for your first evaluation. Go to bayescore.com/bayescore and paste any structured document. That's the entire setup.
Your first evaluation
Describe the task or paste content you want the kernel to verify. The flow is three steps:
- State your intent or paste your input. Type your task or paste the content you want verified.
- The kernel routes your intent. BayesCore reads the document's implicit evaluation structure — the questions any rigorous evaluator would derive from the document itself — and surfaces a root hypothesis and weighted predicates. You can review and edit before saving.
- Pipeline runs with confidence gates. Each step is gated on the agent belief state before executing. Confidence, gate decision, and trace are returned for every step.
Absence of evidence reduces confidence toward zero — the kernel does not paper over missing evidence.
API access
The scoring engine is available as a REST API. See the API Reference section below for endpoint documentation and example payloads. Programmatic access requires an API key — contact [email protected].
How Scoring Works
BayesCore's evaluation architecture is a directed acyclic graph of predicate nodes — a structure informed by Bayesian network theory (Pearl, 1988). Each domain encodes one root hypothesis in IS(subject, criterion) notation with a set of weighted binary predicates. Weights are fixed per domain and set by the domain definition. No empirically derived priors are used; the document evidence alone sets confidence per predicate.
The LLM runs two passes over your document:
- Supportive pass: extracts all evidence that a predicate is satisfied.
- Adversarial pass: looks for counter-evidence, gaps, and missing information.
The scoring formula is locked: score = Σ(weight × confidence) × 100. No editorializing. No bonus points. The LLM sets confidence. Weights are fixed.
Grade bands
| Score | Grade | Meaning |
|---|---|---|
| 85–100 | A | Strong evidence across all predicates |
| 70–84 | B | Good evidence with minor gaps |
| 55–69 | C | Mixed evidence, significant gaps |
| 40–54 | D | Weak evidence, multiple failing predicates |
| 0–39 | F | Insufficient evidence — document fails to demonstrate the criterion |
Scientific basis: Bayes (1763), Cox (1946), Pearl (1988, 2000), Jaynes (2003).
Domains
A domain is a root hypothesis plus a set of weighted predicates. BayesCore ships with one built-in domain. All other domains are created via DNA extraction from your own artifacts.
task-output (built-in)
IS(output, verified)
The only built-in domain. Evaluates any structured text against eight universal soundness criteria — works on agent outputs, research notes, product specs, READMEs, security policies, contracts, and any other structured artifact.
| Predicate | Weight | Question |
|---|---|---|
| central_claim | 18% | Is the central claim or thesis explicitly and specifically stated in the document? |
| evidence_support | 16% | Is the central claim supported by concrete, verifiable evidence present in the document? |
| scope_defined | 14% | Is the intended audience, scope, or use case of the document clearly stated? |
| assumptions_stated | 14% | Are the key assumptions underlying the argument or proposal made explicit? |
| success_criteria | 12% | Are success criteria, desired outcomes, or measurable goals defined? |
| risks_acknowledged | 12% | Are known risks, limitations, or failure modes explicitly acknowledged? |
| next_steps | 8% | Is there a clear next step, recommendation, or call to action? |
| internal_consistency | 6% | Is the argument internally consistent with no contradictory claims? |
domain_key: task-output — use this key in API calls.
Custom Domains
Any structured text can become an evaluation domain. Drop an agent output — BayesCore extracts IS(output, verified). Drop a product spec — it extracts IS(product, claims_supported). The extracted domain is reusable: once extracted from one artifact, it applies to all artifacts of the same class.
Custom domains are identified by a share_uuid returned when the domain is saved. Pass this UUID as domain_key in API calls. Use GET /api/domains to list your saved domains and retrieve their UUIDs.
DNA extraction runs at temperature 0. The same class of artifact produces the same predicate structure on every extraction.
API Reference
Extract evaluation DNA from any structured artifact. Returns a root hypothesis and ranked predicates. Run this first — the result can be saved as a reusable domain.
"text": "Full text of the artifact..."
}
"name": "Grant Application Evaluation",
"root_hypothesis": "IS(output, verified)",
"predicates": [{ "question": "...", "importance": "critical" }]
}
importance values: critical, high, medium, low — converted to weights on save.
Run an evaluation on a document. Use task-output for the built-in domain, or a custom domain's share_uuid for user-created domains.
"domain_key": "task-output", // or custom share_uuid
"document": "Full text of the artifact..."
}
"domain_key": "task-output",
"score": 34,
"grade": "F",
"summary": "...",
"findings": [...],
"highest_leverage_gaps": [...]
}
Optional header: X-AIOS-API-Key: your-key
Returns the list of available domain manifests.
Returns the most recent scan result per domain — score, grade, timestamp.
Paginated scan history. Query param: ?page=1
Desktop App — Bayesian Kernel
The BayesCore desktop app runs a local Bayesian kernel — a five-module Python engine that implements the full Bayesian cycle: prior → likelihood → posterior → action → observe → update. The kernel is grounded in Bayes' theorem: P(H|E) = P(E|H) · P(H) / P(E).
Every component maps to one structural link in the theorem. None of this is metaphorical: the code computes P(H|E) via Beta distributions and the conjugate update rule.
BeliefState — belief.py
P(H): the prior distribution, maintained per task type
The kernel stores a distinct Beta(α, β) distribution for each key — task type, intent, routing target. New keys initialise at Alpha=1.0, Beta=1.0 (the uninformative prior: 50/50, maximum uncertainty). The posterior mean is α / (α + β).
The conjugate update rule fires on every outcome:
belief_state.observe("document_eval", success=True) # α += 1
belief_state.observe("document_eval", success=False) # β += 1
At every session boundary, a forgetting factor decays pseudocounts toward the uninformative prior — preventing early observations from permanently dominating:
α_new = 1.0 + (α - 1.0) × 0.9
β_new = 1.0 + (β - 1.0) × 0.9
Belief state is serialised to disk (JSON) and survives app restarts. This is the kernel's compounding moat: the more you use it, the more calibrated P(success | task_type) becomes.
IntentEngine — intent.py
P(H|E): posterior intent distribution after prior correction
The IntentEngine applies Bayes' theorem to intent routing. The LLM at temperature=0 produces a calibrated probability distribution over intents — this is the likelihood P(E|H). The engine multiplies each intent's likelihood by its prior P(H) from BeliefState, renormalises, and produces the posterior P(H|E):
posterior[intent] = llm_probability[intent] × belief_state.p_success(intent)
posterior = normalise(posterior) # sum to 1.0
Action is gated by INTENT_COMMIT_THRESHOLD = 0.72. If the top posterior exceeds 0.72, the kernel executes. If not, it asks a minimal clarifying question designed to maximise entropy reduction over the ambiguous intents — not an open-ended “what do you mean?”
if top_posterior >= 0.72: # INTENT_COMMIT_THRESHOLD
route_to_agent(top_intent)
else:
ask_entropy_reducing_question(ambiguous_intents)
ProbabilisticScheduler — scheduler.py
EV-ranked execution order — same beliefs, different application
The scheduler orders tasks by expected value, using the same BeliefState posterior means that power the IntentEngine. This closes the coherence requirement: a single belief model determines both what to do and in what order.
EV = P(success | task_type) × utility - cost
# P(success) = belief_state.p_success(task_type)
# utility and cost are set at task creation
On task completion, the scheduler calls belief_state.observe(task_type, success) — automatically, without any manual rating step. This is what closes the Bayesian feedback loop end-to-end.
Feedback Loop
The moat: every outcome sharpens the belief model
The kernel captures implicit signals — not just explicit thumbs-up ratings. A follow-up question implies an incomplete result. A re-run implies the first result was insufficient. Moving to the next task implies success. These signals all route to observe().
The explicit feedback endpoint accepts structured signals from the Electron UI:
{
"task_id": "task_abc123",
"useful": true, // → observe(intent, success=True)
"intent_correct": true, // → observe(routing.{task_type}, success)
"result_quality": 0.85 // → observe(quality.{task_type}, partial credit)
}
Wrong routing (intent_correct=false) also penalises routing.{{task_type}} — so the kernel learns from misroutes, not just from task quality.
Kernel API — main.py
The kernel runs as a local FastAPI server. The Electron app communicates with it over a loopback socket on an ephemeral port assigned at startup.
Submit an intent. Runs the full prior → infer → route → schedule → execute cycle. Returns the result if committed, or a clarification question if the posterior is below threshold.
"user_input": "Evaluate this document for logical consistency",
"document": "...", // optional
"domain_key": "task-output", // optional
"history": [...] // last 6 turns for clarification context
}
Submit outcome signals for a completed task. Updates BeliefState via the conjugate rule. See Feedback Loop above for field semantics.
Returns the current belief state — all Beta distributions by key, with α, β, and posterior mean P(success).
Returns the EV-ranked task queue. Each task shows task_type, EV, utility, cost, and status.
Returns the list of available agents — eval_agent, research_agent, summarise_agent, claim_scorer_agent — with their routing keys and supported domain_keys.