We found that small LLM is systematically more confident on wrong answers than right ones

We tested Hybrid Intelligence system with Karpathy`s autoresearchers about (Bio + LLM) intelligence with almost 30,000 experiments and found that small LLM is systematically more confident on wrong answers than right ones.

Metric Correct Wrong
First-token entropy Higher Lower
Probability margin Lower Higher
t-stat 2.28 −3.41

The model is more uncertain when it’s right. More confident when it’s wrong.

This is the inverse of what calibration should look like.

Also you can check out our first Hybrid Intelligence model: MerlinSafety/HybridIntelligence-0.5B · Hugging Face

4 Likes

Interesting work, and the calibration inversion finding resonates with something I’ve observed from a completely different angle — I’m just a hobbyist, not a researcher, so take this for what it’s worth.

For a small Gradio Space built around a fixed RAG dataset (a novel with three AI characters), I found that coupling a solid base prompt with continuous algorithmic self-prompting during idle time — essentially letting the model rewrite and refine its own internal context while no user is present — produced dramatic coherence improvements. Not marginal ones.

The model arrives at each user interaction already ‘warmed up’, with a denser and more self-consistent internal state. No fine-tuning, no extra components, no GPU overhead. Just structured idle cycles guided by a lightweight algorithm.

You can see the system live here: 432 A Journey Experience - a Hugging Face Space by paulolden1
The theoretical background behind the idle-time approach is outlined in this paper: Emergent Self Preservation Machine Consciousness : Paul Olden & Claude Opus 4.5 : Free Download, Borrow, and Streaming : Internet Archive

The obvious caveat: my dataset is small and constrained, so the effect may be amplified by that. But precisely because the approach is so parsimonious — no extra weights, no selector network, no GPU cost — I wonder whether it would scale, and whether it might address the miscalibration problem you identified by improving the departure point rather than correcting the output after the fact.

Happy to share more details if useful. Your work on separating generation from judgment is genuinely interesting — just coming at the same problem from the other end.

3 Likes