Hidden-State Signals in Iterative LLM Repair: What Cosine Similarity at Layer 27 Actually Tells You

I’ve been running a systematic experiment trying to answer a simple question: can you detect whether an LLM is stuck in an unproductive loop by looking at its hidden states — before the output reveals it?

**The setup:** Qwen2.5-7B-Instruct (4-bit), repair loop on a hard code task (LRU Cache with 5 interdependent bugs, 7 test blocks). Forward hook at Layer 27, extracting `[:, -1, :]` every 50 tokens. Primary signal: `max_prev_similarity` — cosine similarity between the current hidden state and all prior checkpoints.

**What we found:**

The signal is real. In multiple runs, high cosine similarity appeared *before* the output started looping, and there were clear cases where the hidden-state signal detected semantic stagnation that n-gram and code-block text detectors completely missed. Two reproducible dissociation cases — the model was internally circling while producing superficially different-looking outputs.

**The complication:**

High coherence is ambiguous. It marks both productive convergence (the model has found a stable, correct solution) and pathological stagnation (the model is stuck). As a standalone scalar, cosine similarity can’t distinguish attractor types.

This has a concrete implication: if you build an intervention that triggers when coherence is high, you’ll interrupt both good and bad states equally. That’s what happened in our Phase 10.3 — prompt-based interventions triggered by the signal underperformed the baseline (2/8 vs 3/8 success rate).

**What would actually help:**

A second signal to disambiguate. Entropy and confidence margin (logprob-based) show modest combined signal (AUC ~0.59 for regression detection) but aren’t enough on their own either. The more tractable near-term solution turned out to be architectural: a monotonic controller that preserves best-so-far state rather than trying to predict loop states in advance.

**Why this matters beyond the specific task:**

Most inference-time control approaches operate on the output side — token filtering, chain-of-thought steering, sampling interventions. The hidden state provides a different channel: it captures *process dynamics* rather than *output content*. The dissociation between the two is the interesting finding.

Full interim report (10 phases, all results including negatives): https://doi.org/10.5281/zenodo.18941566

Happy to discuss methodology, signal extraction specifics, or the monotonic controller design.

1 Like