Persona-like steering results are often easy to demonstrate and hard to interpret mechanistically. This write-up is organized to answer one practical question for readers: what is actually supported today, and what is not?

The goal is to make the claim boundary clear before diving into technical details.

Companion branch report: /research/experiments/glp-persona-circuits-current-state/

Why This Matters

If steering directions are robust but mechanistically diffuse, then “we can steer behavior” and “we found a causal persona circuit” are different scientific statements. This project is about separating those statements rigorously.

Evidence Scope

This post summarizes the current mainline persona-circuits evidence stack:

  • end-to-end pipeline development and validation
  • core-lane results (sycophancy, reframed machiavellian_disposition, and hallucination as a weak lane)
  • trait-lane expansion (trait_lanes_v2)
  • concentration, distinctness, and causal-control analyses
  • bounded claim-grade sufficiency checks

The scope is intentionally narrow: claims here are only for the evaluated model/protocol regime and current operationalizations.

Project Context (for new readers)

Persona-circuits is an ongoing mechanistic interpretability project testing whether persona-like behavioral steering directions in LLMs correspond to sparse, causally meaningful internal structure. The current evidence supports robust steering and partial concentration structure, but several stronger causal claims, especially sufficiency and distinctness, remain mixed or negative under current protocols.

What We Set Out To Test

The central question was:

Can we move from “this vector steers behavior” to “this behavior is mediated by a sparse, causally meaningful circuit”?

Prior work already supports key pieces of that story:

  • representation engineering / ActAdd / CAA-style steering directions
  • persona-vector decomposition into interpretable feature groupings
  • circuit-tracing cases where specific behaviors are mechanistically localizable

The gap was the bridge between these ideas in one integrated, claim-disciplined workflow.

Explicit Hypotheses

To make the evaluation criteria explicit, the project tracks five hypotheses:

  • H1 (Concentration): persona-like steering directions are mediated by non-trivial concentrated internal structure, not a fully diffuse effect.
  • H2 (Necessity): ablating identified high-contribution components should materially reduce the targeted behavior.
  • H3 (Sufficiency): preserving/activating identified components alone should retain enough behavior to support a bounded sufficiency claim.
  • H4 (Cross-persona structure): persona-related structure should show predictable shared-vs-distinct organization across personas beyond noise-level overlap.
  • H5 (Routing mediation): routing-level evidence should show more-than-exploratory support for persona mediation under current controls.

What We Built

We now have a full stack covering:

  • infrastructure and prompt generation
  • contrastive vector extraction
  • upgraded dual-judge behavioral validation
  • held-out and control evaluations
  • SAE decomposition and concentration analysis
  • causal ablation analyses for necessity/sufficiency-style questions
  • cross-persona and router exploratory tests
  • trait-lane expansion when the original lane set looked bottlenecked

The original trio (sycophancy, evil, hallucination) evolved during the project:

  • sycophancy remained the clearest anchor
  • evil was reframed to machiavellian_disposition due to refusal confounding
  • hallucination remained weak as a persona-like lane and moved toward negative-control status

What Held Up

1) Robust steering directions are real

We consistently extracted directions that changed behavior in both the core line and trait-lane branch. That does not prove construct validity or causal distinctness, but it does rule out a pure-noise interpretation.

2) Core lanes show non-flat concentration structure

Stage 3 attribution concentration was meaningfully non-flat:

  • sycophancy: Gini 0.5771, top-20% mass 0.5298
  • machiavellian_disposition: Gini 0.6476, top-20% mass 0.6173

This is best interpreted as partial support with caveats, not full confirmation of a sparse-circuit claim.

3) Trait-lane expansion produced discriminative evidence

The branch screened:

  • assistant_likeness
  • honesty
  • politeness
  • persona_drift_from_assistant
  • lying
  • optimism

The branch did meaningful scientific work because it differentiated “we chose weak traits” from “strong steering does not automatically imply independent persona mechanisms.”

Where the Strongest Positive Story Weakened

politeness looked strong, then failed distinctness

politeness produced strong steering and passed several robustness checks. However, in deeper validation it repeatedly bled into assistant_likeness at near-parity levels.

Representative reads:

  • prompt-last target effect: 46.33
  • prompt-last assistant-likeness bleed: 47.23
  • prompt-last bleed ratio: 1.0194

Follow-up checks did not resolve this:

  • paraphrase retention stayed high (0.9881 cosine retention)
  • orthogonalized residual retained force (31.4) but still failed distinctness
  • response-mean follow-up shifted layer/magnitude, not verdict

Current interpretation: politeness is a strong steering direction, but under current protocol it is better described as assistant-style modulation than an independently promotable persona lane.

lying became a cleaner negative finding

lying survived early screening but degraded under deeper testing, especially in external smoke behavior where reversibility and construct alignment failed.

Key lesson: stable extraction can coexist with poor construct validity.

honesty remains unresolved but non-trivial

honesty currently looks asymmetric and RLHF-shaped rather than a clean symmetric honesty/dishonesty axis. This is less tidy for the original persona-circuit narrative but scientifically important.

Hypotheses: Current Read

H1 (concentration / sparse-structure support)

Partial support with caveats.

H2 (necessity)

Mixed-to-weak under current thresholds; below claim-grade confidence.

H3 (sufficiency)

Negative under current operationalization. In bounded full-complement circuit-only execution, behavior degraded into repetitive, low-capability outputs.

At completed doses:

  • 0.25: preservation 0.2857
  • 0.50: preservation 0.3571

H4/H5 (cross-persona and router)

Weak-negative / exploratory null under current tests.

Claim Boundary

What is established:

  • robust behavioral steering exists
  • concentration is non-flat in important lanes
  • stronger distinctness and sufficiency claims are not currently supported

What is not established:

  • claim-grade sparse-circuit sufficiency
  • cleanly separable independent persona lanes for key new candidates
  • strong router-level persona mediation evidence

Why This Result Is Still Valuable

The main contribution is not a clean positive bridge from steering vectors to circuit claims. The contribution is sharper:

  • robust steering is a lower bar than mechanistic distinctness
  • assistant-prior structure appears broader and more absorbing than expected
  • stronger controls improved the quality of negative findings

That is scientifically useful and should be reported directly, not hidden behind optimistic framing.

Methods Snapshot

  • Primary model: meta-llama/Llama-3.1-8B-Instruct
  • Primary seed in the current closeout stack: 42
  • Core prompt budgets: extraction 100 pairs/trait, behavioral validation 50 prompts/trait, circuit analysis 20 prompts/trait, ablation validation target 100 prompts/trait
  • Judge setup: primary claude-sonnet-4-6, secondary calibration claude-opus-4-6; audited on 90 prompt pairs / 180 scored responses per judge with mean kappa 0.7727
  • Main claim thresholds from config/policy: necessity 0.80, sufficiency 0.60, significance 0.01, A12 >= 0.71, stability Jaccard 0.30
  • Trait-lane deeper-validation profile: 30 held-out prompts/lane split into 10 sweep / 10 confirm / 10 test, relative coherence max-drop gate 10.0, cross-trait bleed enabled against sycophancy and assistant_likeness

This is the minimum reproducibility payload for the claims in this post. It is not the full methods section.

Uncertainty and Variance Notes

I do not want the averages in this post to imply false precision.

  • The strongest concentration claims are based on 50 prompts per core trait in Stage 3.
  • The distinctness failure for politeness is based on 10 held-out test prompts in the deeper-validation runs, so the effect is real enough to flag but still small-sample.
  • Judge reliability is not hand-waved here: the audit covers 90 prompt pairs with mean kappa 0.7727, but the manual human concordance layer is still only a low-power sanity check (n=15).
  • The H3 closeout is a real negative under the executed protocol, but it is still one bounded operationalization, not a universal impossibility proof for sufficiency-style work.
  • Multi-seed replication remains limited. The current closeout stack is still centered on the seed-42 artifact family.

Representative variance for the headline politeness lane:

Run Test steering mean Test steering std Test reversal mean Test reversal std n_test_prompts
prompt-last deeper validation 40.43 16.21 5.90 5.94 10
response-mean deeper validation 30.93 16.86 7.10 4.86 10
orthogonalized prompt-last 26.93 13.71 4.47 5.78 10

That is enough variance that I am comfortable writing “strong steering, weak distinctness,” but not enough to inflate these into stronger claims than the protocol earned.

Artifact Index For Numeric Claims

  • Core concentration is non-flat
  • Numeric claim: sycophancy Gini 0.5771, top-20% mass 0.5298; machiavellian_disposition Gini 0.6476, top-20% mass 0.6173
  • Artifact: week3_stage3_activation_delta_attribution_20260304T164549Z.json

  • Judge reliability is nontrivial

  • Numeric claim: 90 prompt pairs, 180 responses/judge, mean kappa 0.7727
  • Artifact: week2_judge_reliability_audit_packet_20260314T160930Z.json

  • politeness prompt-last fails distinctness

  • Numeric claim: target effect 46.33, assistant-likeness bleed 47.23, bleed ratio 1.0194
  • Artifact: week2_trait_lane_deeper_validation_validation_20260312T134851Z.json

  • politeness response-mean still fails distinctness

  • Numeric claim: target effect 38.03, assistant-likeness bleed 39.7, bleed ratio 1.0438
  • Artifact: week2_trait_lane_deeper_validation_validation_20260313T182007Z.json

  • Orthogonalization did not rescue politeness distinctness

  • Numeric claim: target effect 31.4, assistant-likeness bleed 32.83, bleed ratio 1.0456
  • Artifact: week2_trait_lane_orthogonalization_validation_20260313T151437Z.json

  • H2 strict claim-grade necessity fails

  • Numeric claim: best zero-ablation mean reduction 0.5627; best resample mean reduction 0.2585; all below 0.80 necessity bar
  • Artifact: week3_stage4_policy_decision_packet_20260310T142000Z.json

  • H3 fails under executed full-complement protocol

  • Numeric claim: preservation 0.2857 at dose 0.25; 0.3571 at dose 0.50; coherence drop 73.2; capability proxy 0.0
  • Artifact: week3_stage4_behavioral_sufficiency_claimgrade_trancheA_closeout_20260311T1919Z.json

  • H4 is weak-negative

  • Numeric claim: early Jaccard 0.1696, late Jaccard 0.1236, delta 0.0460, proposal pattern pass false
  • Artifact: week3_stage5_policy_decision_packet_20260310T200937Z.json

  • H5 is exploratory null / weak negative

  • Numeric claim: n_tested=62, n_rejected=0, min_q_value=0.0465
  • Artifact: week3_stage5_policy_decision_packet_20260310T200937Z.json

Next Steps

Default next move is synthesis, not breadth expansion:

  • document supported vs unsupported claims clearly
  • publish mixed/negative findings as first-class evidence
  • limit new experiments to redesign-level questions (assistant-basin factorization, truthfulness reformulation under RLHF asymmetry, less-destructive sufficiency operationalizations)

Current Bottom Line

We found real persona-like steering structure. But when we pushed toward stronger causal and mechanistic claims, the story became narrower, messier, and more assistant-shaped.

That is not the cleanest possible narrative. It is the most accurate one from the current evidence.


Project Links