We found that small LLM is systematically more confident on wrong answers than right ones

squ11z1 · March 14, 2026, 6:23pm

We tested Hybrid Intelligence system with Karpathy`s autoresearchers about (Bio + LLM) intelligence with almost 30,000 experiments and found that small LLM is systematically more confident on wrong answers than right ones.

Metric	Correct	Wrong
First-token entropy	Higher	Lower
Probability margin	Lower	Higher
t-stat	2.28	−3.41

The model is more uncertain when it’s right. More confident when it’s wrong.

This is the inverse of what calibration should look like.

Also you can check out our first Hybrid Intelligence model: MerlinSafety/HybridIntelligence-0.5B · Hugging Face

paulolden1 · March 15, 2026, 12:13pm

Interesting work, and the calibration inversion finding resonates with something I’ve observed from a completely different angle — I’m just a hobbyist, not a researcher, so take this for what it’s worth.

For a small Gradio Space built around a fixed RAG dataset (a novel with three AI characters), I found that coupling a solid base prompt with continuous algorithmic self-prompting during idle time — essentially letting the model rewrite and refine its own internal context while no user is present — produced dramatic coherence improvements. Not marginal ones.

The model arrives at each user interaction already ‘warmed up’, with a denser and more self-consistent internal state. No fine-tuning, no extra components, no GPU overhead. Just structured idle cycles guided by a lightweight algorithm.

You can see the system live here: 432 A Journey Experience - a Hugging Face Space by paulolden1
The theoretical background behind the idle-time approach is outlined in this paper: Emergent Self Preservation Machine Consciousness : Paul Olden & Claude Opus 4.5 : Free Download, Borrow, and Streaming : Internet Archive

The obvious caveat: my dataset is small and constrained, so the effect may be amplified by that. But precisely because the approach is so parsimonious — no extra weights, no selector network, no GPU cost — I wonder whether it would scale, and whether it might address the miscalibration problem you identified by improving the departure point rather than correcting the output after the fact.

Happy to share more details if useful. Your work on separating generation from judgment is genuinely interesting — just coming at the same problem from the other end.

Topic		Replies	Views
Beyond Correction: Epistemic Safety as a Mediator for Policy Transfer in Large Language Models Research	0	37	November 29, 2025
Paraconsistent Logic and AI models Beginners	4	39	March 22, 2026
MarCognity-AI for 13 Critical Questions About LLMs Research	2	92	October 17, 2025
The Latent Space Charter Show and Tell	1	46	January 12, 2026
Evidence of latent collapse geometry in frontier LLMs? Research	3	207	December 31, 2025

We found that small LLM is systematically more confident on wrong answers than right ones

Related topics