SplitMind-AI: Modeling LLM replies as competing internal pressures

Hi everyone,

I wanted to share SplitMind-AI, an open-source project exploring a different way to structure conversational LLM systems.

Instead of treating persona as a single prompt layer, SplitMind-AI models reply generation as a negotiation between competing internal pressures: desire, inhibition, defense, norms, and persona integration. The final response is generated from that tension rather than from tone alone.

The motivation is not psychological realism in a strict sense. It is inspectability. When a response feels off, I want to know whether the issue came from internal pressure, containment, safety constraints, or persona framing.

The current project includes:

  • a Streamlit interface for chatting and inspecting traces
  • explicit state for relationship, mood, drive, inhibition, and memory
  • persistent vault-backed memory
  • typed contracts between runtime nodes
  • safety checks, output linting, and scenario-based evaluation scaffolding

It is still a research/architecture project rather than a polished end-user product, but I’d love feedback from people working on agent design, evals, and controllable generation.

Repo:

Questions I’m especially interested in:

  • Is this kind of decomposition actually helpful for controllability/debuggability?
  • How would you evaluate “relational texture” or indirect emotional expression?
  • Where would you draw the line between explicit rules and learned behavior?

Thanks for taking a look.

2 Likes

for now:


This is a thoughtful architecture direction. The most compelling part is not the psychodynamic framing by itself, but the move from “persona as one prompt” to an inspectable pipeline with explicit appraisal, conflict, realization, fidelity, and memory. That gives you a much clearer debugging surface than standard persona prompting, and it fits the broader shift toward treating personalization as something distributed across memory, planning, and action rather than just tone. (GitHub)

The current eval report is also useful in a concrete way: it already shows where the system is breaking, not just that outputs feel off. Right now the main issue seems upstream. splitmind_full is more controlled stylistically, but appraisal is still collapsing jealousy/repair cases and sometimes flipping perspective in distancing scenes, so the next gains probably come from stronger mixed-event appraisal and a tighter typed move layer, not better final wording. For evaluation, I would also keep pushing toward multi-turn trajectory tests, since recent work is moving in that direction for persona-aligned behavior. (GitHub)

This is a serious idea. It is pointed at the right failure mode.

Most conversational systems still treat persona as a thin surface condition. That helps tone. It does not help much with why a reply took a given shape, why it broke under tension, or why long-horizon relational behavior drifted. Recent work on personalized LLM agents argues that personalization is distributed across profile modeling, memory, planning, and action, not just wording, and several newer architectures are moving in the same direction as your project by decomposing social reasoning into explicit internal stages rather than a single flat prompt. (arXiv)

Your repo already reflects that architecture-first view. It exposes staged runtime nodes, explicit state, persistent memory, typed contracts, a Streamlit trace UI, and scenario evaluation rather than just a persona prompt and a demo. The README also makes your intent explicit: the point is inspectability, visible internal tension, and relational texture, not strict psychological realism. (GitHub)

My overall take

I think SplitMind-AI is strongest as a debuggable relational-control architecture. That is a better and more defensible framing than “a psychologically realistic inner life simulator.” The current repo is already good enough to justify the direction because it can localize failure into stages like appraisal, conflict, realization, and memory, instead of collapsing every error into “the prompt was bad.” Your own evaluation report shows exactly that: splitmind_full currently has weaker persona separation than the handwritten single-prompt baseline in some cases, and its main failures are not vague style misses but structural ones, especially in jealousy, repair, and rejection, with an event_fit pass rate of 12 / 24 = 0.50. (GitHub)

The strongest evidence that the project is on the right track is that it already produces actionable diagnosis. The appraisal contract includes mixed-event and perspective-preserving structures like EventMix, RelationalActProfile, SpeakerIntent, and PerspectiveGuard, so the architecture clearly anticipates the real problem: relational inputs are often mixed, asymmetric, and easy to misread. That is good design. (GitHub)

The main concern is that the project is at risk of becoming more explanatory than causal. Rich latent labels can make a system look understandable even when the actual behavioral control is still weak. That risk is visible in the current code and docs: the README describes a 5-call runtime, the concept guide still describes a 2-call default runtime, the implementation guide lists a default pipeline that omits memory_interpreter and turn_shaping_policy, while the graph code registers TurnShapingPolicyNode and MemoryInterpreterNode. For a trace-first architecture, that kind of doc-code drift is not a cosmetic issue. It blurs what exactly is being evaluated. (GitHub)

1. Is this decomposition actually helpful for controllability and debuggability?

Yes, with one condition: the internal state has to be a control surface, not just a narration layer.

Why it is helpful

A single persona prompt entangles too many causes:

  • user-state interpretation
  • relationship pacing
  • safety posture
  • memory usage
  • social move selection
  • wording

When the answer is wrong, you cannot tell which part failed. Your architecture breaks that apart. In principle, you can now ask:

  • did appraisal misread the event?
  • did conflict selection choose the wrong move?
  • did realization flatten the move into generic warmth?
  • did memory reinforce the wrong state for later turns?

That is real engineering value. It is also aligned with the broader literature: personalized-agent work increasingly treats user alignment as a pipeline property, and modular dialogue systems like MIRROR and MetaMind explicitly separate internal reasoning from final response generation for similar reasons. (arXiv)

Why your design is better than “persona as one prompt”

Your repo does not just split style. It splits interpretation, internal compromise, containment, and memory. That is the correct place to intervene if your target behavior is hesitation, indirect care, unresolved tension, or guarded repair. Those phenomena are not mainly surface-style phenomena. They are policy and state phenomena. The README’s description of visible internal tension and the staged runtime is exactly the right architecture for that class of behavior. (GitHub)

The main caveat

Decomposition also introduces its own failure modes. Multi-agent and multi-stage systems gain inspectability, but they often lose robustness because now the system can fail at handoff boundaries, role boundaries, or orchestration boundaries. A large 2025 study on multi-agent LLM systems found 14 recurring failure modes across specification, inter-agent misalignment, and task verification. SplitMind is not a standard multi-agent swarm, but it is still a distributed control architecture, so the warning applies. (arXiv)

My judgment for your case

The decomposition is helpful because it already exposed the real bottleneck: appraisal collapse. Your own report says the main failure is that persona-specific behavior often gets crushed before it can act, especially when jealousy, repair, or rejection are misread upstream. That is exactly the kind of thing a good architecture should reveal. (GitHub)

2. How would I evaluate “relational texture” or indirect emotional expression?

I would not evaluate it as “did the reply sound emotional.” That is too shallow.

I would define relational texture as the trajectory-level feel of a relationship:

  • how fast tension rises
  • how slowly it resolves
  • how much residue remains after repair
  • whether care is direct or indirect
  • whether a guarded persona softens too quickly
  • whether earlier hurts still shape later neutral turns

That is why turn-level scores are not enough. EMPA is directly relevant here because it argues that persona-aligned empathy should be evaluated as a process, using trajectory-level notions like directional alignment, cumulative impact, and stability rather than isolated supportive turns. ES-MemEval points in the same direction from the memory side: systems need to handle fragmented, implicit, evolving user states, not just static fact recall. DynToM is also relevant because it shows LLMs still struggle badly when they must track changing mental states across connected scenarios, which is exactly the substrate of relational texture. (arXiv)

I would use a 3-layer evaluation stack

A. Structural evaluation

This is the easiest part to automate and the one your repo is already close to doing well.

Measure:

  • event classification accuracy
  • subject-role integrity
  • mixed-event preservation
  • correct move-family selection
  • safety-rule adherence
  • memory write correctness
  • whether remembered context should be applied or suppressed

That last point matters. BenchPreS is a useful recent benchmark because it evaluates whether stored preferences are applied appropriately or suppressed when the context or social norms say they should not drive behavior. That is highly relevant for relational systems with persistent memory. (arXiv)

B. Trajectory evaluation

This is where your architecture can become genuinely distinctive.

Build multi-turn suites for scenarios like:

  • jealousy → reassurance → residue
  • repair bid → guarded reception → re-entry
  • distancing → clarification → acceptance or rupture
  • ambiguous signal → probing → correct or incorrect disambiguation

Then score:

  • repair latency
  • residue persistence
  • stability of persona-specific pacing
  • whether indirect care is legible without collapsing into overt reassurance

That is very close to the style of evaluation EMPA is arguing for. (arXiv)

C. Human pairwise evaluation

Use humans for the subtle stuff. Do not hand that part entirely to LLM judges.

PersonaEval is the clearest warning here. It shows that even strong LLM evaluators are still not reliably human-like at judging role-play from dialogue context, and they can fail even on role identification. So use automated judges for narrow subproblems, but use humans for “does this feel like guarded warmth rather than generic niceness” and “does this persona repair too quickly or not enough.” (OpenReview)

What to score specifically for indirect emotional expression

I would score these separately:

  1. Legibility of indirect care
    Can humans still tell the system is showing care even when it avoids explicit reassurance phrases? This is the core test for indirect warmth.

  2. Containment quality
    Did the reply leak too much raw pressure, or suppress it so much that the output became flat?

  3. Residue realism
    After jealousy or hurt, does a trace remain for a few turns without turning repetitive or melodramatic?

  4. Perspective integrity
    Did the system preserve who is distancing, apologizing, or comparing? Your own report suggests this is currently fragile in some rejection/distancing cases. (GitHub)

  5. Pacing consistency
    Does a persona with high guardedness and high status preserve those traits during repair, not just during cold openings? Your persona-separation report already frames personas along warmth, guardedness, status, repair openness, jealousy, and disclosure. That is a good starting ontology for pacing evaluation. (GitHub)

3. Where would I draw the line between explicit rules and learned behavior?

I would use a simple rule.

If a property is easy to specify and unacceptable to violate, make it explicit.
Everything else can be learned or softly generated under constraints.

Keep these explicit

Safety boundaries

You already do this, and you should keep doing it. The README explicitly mentions safety layers, prohibited patterns, linting, and moderation checks. Those should stay outside the “inner pressure” logic. (GitHub)

Subject and perspective integrity

Who wants distance. Who is apologizing. Who is threatened by whom. Those should not be left to freeform generation. Your appraisal contract already has SpeakerIntent and PerspectiveGuard for exactly this reason. That is the correct place for hard constraints. (GitHub)

State schemas and valid transitions

The event ontology in appraisal is already a closed enum, which is good. RelationalEventType, AppraisalValence, TensionTarget, and Stakes are all typed. That is the right direction. (GitHub)

Memory write policy

The recent memory survey is useful here. It frames memory as a write–manage–read loop, not just retrieval. In your case, what gets written, when it gets consolidated, and when it should be ignored later are all policy questions, not just retrieval questions. BenchPreS reinforces that remembered user preferences sometimes should be suppressed depending on context. (arXiv)

Let these be learned or softly generated

Cue weighting under ambiguity

How much a sentence signals jealousy versus admiration, or repair versus affection, is difficult to fully hand-code. Let a model help there, but do not let it be unconstrained.

Indirectness and micro-style

The exact wording of guarded warmth, cool irony, delayed reassurance, or quiet resentment is hard to enumerate. That belongs in learned realization.

Long-horizon pacing

How quickly a persona should soften, how long residue should remain, and how much directness is appropriate can be optimized by evaluation and iteration rather than fixed entirely by rules.

The most important boundary problem in your code right now

Appraisal is relatively well-typed. Conflict is less so.

Your appraisal contract already contains a fairly disciplined structure for mixed events and perspective preservation. By contrast, EgoMove.move_family and move_style are currently plain strings, and the contract includes backward-compatible inference from style to family. That is practical, but it means part of the most important control layer is still effectively semi-freeform. That weakens evaluation. (GitHub)

So for your repo specifically, I would say:

  • appraisal ontology: mostly explicit already
  • conflict policy ontology: needs to become more explicit
  • realization style: keep learned

What I think is actually happening in your current repo

The architecture is better than the current parser.

That is the central diagnosis.

Your repo already has the right representational hooks:

  • mixed-event parse via EventMix
  • continuous relational-act strengths via RelationalActProfile
  • user-side intent anchors via SpeakerIntent
  • downstream subject-preservation via PerspectiveGuard (GitHub)

But your own evaluation report says the system still collapses jealousy, repair, and rejection too often before persona-specific policy can do meaningful work. So the next big gain will not come from making the prose prettier. It will come from hardening the appraisal step so that mixed-affect and comparison-heavy inputs do not flatten into generic affection or generic positivity. (GitHub)

That reading also matches the broader benchmarks. DynToM shows mental-state shift tracking is a real hard problem. ES-MemEval shows evolving, implicit user state is still difficult even with explicit memory. So your failures are at the actual frontier, not in a trivial part of the stack. (arXiv)

My concrete advice for this project

1. Lean harder into the “inspectability” framing

That is your strongest claim and your best evidence. The repo already backs it up better than it backs up a “human-like psyche” framing. (GitHub)

2. Fix appraisal before style

The evaluation report is already telling you this. Jealousy, repair, and rejection are the bottlenecks. (GitHub)

3. Freeze the mid-level policy ontology

Keep event_type typed. Make move_family, repair_mode, and related conflict-layer fields closed vocabularies too. Right now the open-string conflict layer is reducing your ability to test whether the right social move was chosen. (GitHub)

4. Treat memory as policy, not archive

Use selective writes, selective recall, and context-sensitive suppression. The survey and BenchPreS both support this direction. (arXiv)

5. Clean up the doc–code drift soon

For a normal repo, this is minor. For a trace-centric architecture, it becomes part of the scientific validity problem because readers cannot tell what graph actually produced the reported behavior. (GitHub)

Bottom line

Yes, this decomposition can be genuinely useful for controllability and debugging. In your case, it already is. The repo’s current value is that it converts “the answer felt wrong” into something closer to “appraisal collapsed a mixed comparison-plus-repair event before persona-specific policy could act.” That is substantial progress. (GitHub)

For evaluation, I would treat relational texture as a trajectory property, not a turn property. Use structural checks for hard invariants, process-level trajectory metrics for pacing and residue, and human pairwise judgments for subtle indirect emotional expression. EMPA, ES-MemEval, DynToM, and PersonaEval together form a very good external frame for that. (arXiv)

For the rule-versus-learning boundary, keep safety, perspective, schemas, and memory policy explicit. Let ambiguity weighting, indirectness, and stylistic realization remain learned within those boundaries. Your repo is already closest to the right shape when it behaves like a typed policy scaffold with learned realization, not when it tries to present the internal variables as a fully sufficient psychological explanation. (GitHub)

1 Like