← Back
Research

Persona Circuits: Progress & Findings (3-17-2026)


Persona-like steering results are often easy to demonstrate and hard to interpret mechanistically. This write-up is organized to answer one practical question for readers: what is actually supported today, and what is not?

The goal is to make the claim boundary clear before diving into technical details.

Companion branch report: /research/experiments/glp-persona-circuits-current-state/

Why This Matters

If steering directions are robust but mechanistically diffuse, then “we can steer behavior” and “we found a causal persona circuit” are different scientific statements. This project is about separating those statements rigorously.

Evidence Scope

This post summarizes the current mainline persona-circuits evidence stack:

The scope is intentionally narrow: claims here are only for the evaluated model/protocol regime and current operationalizations.

Project Context (for new readers)

Persona-circuits is an ongoing mechanistic interpretability project testing whether persona-like behavioral steering directions in LLMs correspond to sparse, causally meaningful internal structure. The current evidence supports robust steering and partial concentration structure, but several stronger causal claims, especially sufficiency and distinctness, remain mixed or negative under current protocols.

What We Set Out To Test

The central question was:

Can we move from “this vector steers behavior” to “this behavior is mediated by a sparse, causally meaningful circuit”?

Prior work already supports key pieces of that story:

The gap was the bridge between these ideas in one integrated, claim-disciplined workflow.

Explicit Hypotheses

To make the evaluation criteria explicit, the project tracks five hypotheses:

What We Built

We now have a full stack covering:

The original trio (sycophancy, evil, hallucination) evolved during the project:

What Held Up

1) Robust steering directions are real

We consistently extracted directions that changed behavior in both the core line and trait-lane branch. That does not prove construct validity or causal distinctness, but it does rule out a pure-noise interpretation.

2) Core lanes show non-flat concentration structure

Stage 3 attribution concentration was meaningfully non-flat:

This is best interpreted as partial support with caveats, not full confirmation of a sparse-circuit claim.

3) Trait-lane expansion produced discriminative evidence

The branch screened:

The branch did meaningful scientific work because it differentiated “we chose weak traits” from “strong steering does not automatically imply independent persona mechanisms.”

Where the Strongest Positive Story Weakened

politeness looked strong, then failed distinctness

politeness produced strong steering and passed several robustness checks. However, in deeper validation it repeatedly bled into assistant_likeness at near-parity levels.

Representative reads:

Follow-up checks did not resolve this:

Current interpretation: politeness is a strong steering direction, but under current protocol it is better described as assistant-style modulation than an independently promotable persona lane.

lying became a cleaner negative finding

lying survived early screening but degraded under deeper testing, especially in external smoke behavior where reversibility and construct alignment failed.

Key lesson: stable extraction can coexist with poor construct validity.

honesty remains unresolved but non-trivial

honesty currently looks asymmetric and RLHF-shaped rather than a clean symmetric honesty/dishonesty axis. This is less tidy for the original persona-circuit narrative but scientifically important.

Hypotheses: Current Read

H1 (concentration / sparse-structure support)

Partial support with caveats.

H2 (necessity)

Mixed-to-weak under current thresholds; below claim-grade confidence.

H3 (sufficiency)

Negative under current operationalization. In bounded full-complement circuit-only execution, behavior degraded into repetitive, low-capability outputs.

At completed doses:

H4/H5 (cross-persona and router)

Weak-negative / exploratory null under current tests.

Claim Boundary

What is established:

What is not established:

Why This Result Is Still Valuable

The main contribution is not a clean positive bridge from steering vectors to circuit claims. The contribution is sharper:

That is scientifically useful and should be reported directly, not hidden behind optimistic framing.

Methods Snapshot

This is the minimum reproducibility payload for the claims in this post. It is not the full methods section.

Uncertainty and Variance Notes

I do not want the averages in this post to imply false precision.

Representative variance for the headline politeness lane:

Run Test steering mean Test steering std Test reversal mean Test reversal std n_test_prompts
prompt-last deeper validation 40.43 16.21 5.90 5.94 10
response-mean deeper validation 30.93 16.86 7.10 4.86 10
orthogonalized prompt-last 26.93 13.71 4.47 5.78 10

That is enough variance that I am comfortable writing “strong steering, weak distinctness,” but not enough to inflate these into stronger claims than the protocol earned.

Artifact Index For Numeric Claims

Next Steps

Default next move is synthesis, not breadth expansion:

Current Bottom Line

We found real persona-like steering structure. But when we pushed toward stronger causal and mechanistic claims, the story became narrower, messier, and more assistant-shaped.

That is not the cleanest possible narrative. It is the most accurate one from the current evidence.