Methodology

We read distributions out of prediction markets.

A market price is information under pressure — thousands of participants, real money on the line, folding what they know into one number. That number is already a good probability. What it doesn’t tell you is how much to trust it, or whether the crowd’s recurring habits have biased it.

Credence reads both back out: a calibrated probability, and the evidence strength behind it. The pair is what your agent should act on — never the point alone.

Calibrated means real-world.

A probability is calibrated when it holds up over the long run: the things you call 70% happen about 70% of the time. That’s the bar we hold every number to, correcting for systematic patterns.

We don’t aim to beat the market on every question; we aim for Credence’s 70% to mean 70%.

The output is a distribution, not a point.

We return a Beta distribution — the natural shape for a probability that is itself uncertain. Two numbers define it:

  • p_calibrated — the mean: our best estimate of the real-world probability.
  • n_effective — evidence strength; technically, the Beta posterior concentration. How much evidence stands behind p_calibrated.

They map straight onto the distribution’s parameters: alpha = p_calibrated × n_effective, and beta = (1 − p_calibrated) × n_effective.

High evidence strength (n_effective) gives a tall, narrow posterior — the calibrated mean is better supported. Low evidence strength gives a broad posterior — treat the mean with care. Your agent can take the mean, the spread, a tail, or reason over the whole shape.

Two Beta posteriors at the same mean — one tall and narrow (high n_effective), one broad (low n_effective).
Evidence strength sets how tall and narrow the posterior is.

Evidence strength: how much to trust the number.

n_effective is the piece no market publishes. It’s a real per-question signal — the evidence-strength estimate rises and falls with how much consistent evidence a market actually carries; it is not a constant we paste on.

And it’s held to the same standard as the probability: we aim that stated evidence strength means what it says, so the intervals you act on contain the truth as often as they promise.

Tested against reality, out of sample.

Numbers are checked the only way that counts — against outcomes the model never saw when it made the call. We score with proper scoring rules, measure calibration head-on, and verify that our stated uncertainty has real coverage.

No look-ahead. No grading our own homework.

Reproducible, versioned, always improving.

Every response is pinned to an immutable snapshot: a snapshot ID, the model version, the release, and the inputs behind it. Ask for that snapshot tomorrow and you get the same answer, by digest.

The research moves fast — more data, sharper calibration, better evidence strength (n_effective) — but it all ships behind a stable contract, each version gated before it goes live and tagged on every number it produces. You never have to track our progress; you know which model answered and you can always reproduce it.