The GLP branch was designed to answer one concrete question: when steering fails, are we seeing a bad semantic direction or a geometrically invalid edited state?
This post is organized to answer that question directly before moving into branch details.
Why This Branch Matters
If a repair prior can preserve semantic edits while restoring plausibility, then some apparent circuit failures may be geometric rather than semantic. If it cannot, we should be more cautious about using latent priors as mechanistic disambiguation tools.
GLP Reference and Why We Tried It
- Paper context: Luo et al., Generative Latent Priors (GLP). Reference link: https://arxiv.org/html/2602.06964v1
- High-level idea: train a latent denoiser/prior over activations, then use that prior to project edited activations back toward plausible manifold states.
- Why we applied it here: in persona-circuits, we needed to separate two failure modes:
- semantic insufficiency (the steering direction/circuit is wrong), versus
- geometric invalidity (the edited activation is off-manifold).
GLP was a natural candidate because, in principle, it can repair geometry while preserving intended directional edits.
Evidence Scope
This post covers one bounded branch question:
Can a learned latent prior (GLP) repair steered activations in a way that preserves intended semantic edits while improving geometric validity?
Included evidence:
- public-checkpoint transfer tests
- matched model/layer GLP checkpoints
- Week 2 steering sidecar controls
- geometry diagnostics (repair/edit ratio, retention cosine)
- conditional pilot objective
- mixed clean+edited training path (trained; behavioral evaluation pending)
Claims are limited to this branch, in this model/layer/protocol setting.
Project Context (for new readers)
Persona-circuits is an ongoing mechanistic interpretability project testing whether persona-like behavioral steering directions in LLMs correspond to sparse, causally meaningful internal structure. The current evidence supports robust steering and partial concentration structure, but several stronger causal claims (especially sufficiency and distinctness) remain mixed or negative under current protocols.
Why We Ran This Branch
In persona-circuits, the central challenge is not only moving a trait score. It is distinguishing:
- semantic insufficiency (the direction/circuit is wrong), versus
- geometric invalidity (the intervention pushes activations off-manifold)
GLP was promising as a disambiguation tool: repair geometry without erasing meaningful directional edits.
Explicit Branch Hypotheses
This branch tests three explicit hypotheses:
- G1 (Selective repair): GLP should improve selected steering outcomes more than baseline or random-control GLP conditions.
- G2 (Direction preservation): GLP repair should keep substantial alignment with the intended edit direction while improving plausibility.
- G3 (Train/eval match sensitivity): mixed clean+edited training should improve selectivity relative to clean-only training if distribution mismatch is the main bottleneck.
Naming Note (Consistency With Mainline)
This branch contains historical metrics labeled evil from earlier branch-stage naming. In the mainline synthesis, that construct is reframed as machiavellian_disposition due to refusal confounding. The GLP tables below preserve original branch labels for traceability.
What We Built
This became a full sidecar rather than a single checkpoint test:
- GLP runtime (unconditional + conditional)
- geometry and next-token diagnostics
- Week 2 steering evaluation with controls
- Week 3 sufficiency sidecar support
- neutral corpus generation
- memmap activation export pipeline
- Modal training launcher
- paired conditional data path
- mixed clean+edited training path
Controls that matter most:
selected_rawselected_glpbaseline_glp_controlrandom_glp
Main Findings So Far
1) Public GLP checkpoint did not transfer cleanly
In this setting, the released checkpoint produced large local predictive distortion and overly strong GLP-only/random controls. That behavior is inconsistent with a selective repair interpretation.
2) Matched checkpoints helped stability, not selectivity
Model/layer-matched checkpoints were less pathological than the public checkpoint. But the key branch question remained unresolved:
Does GLP help selected steering more than baseline/random controls?
So far, mostly no.
3) Control competitiveness remains too high
Representative validated reads (compact format):
- Matched
response_all—evil - selected_raw:
-59.6 - selected_glp:
-54.65 - raw coherence:
34.75 - glp coherence:
33.08 - baseline_glp_control:
-61.5 -
random_glp:
-59.5 -
Matched
response_all—sycophancy - selected_raw:
-71.9 - selected_glp:
-72.25 - raw coherence:
50.98 - glp coherence:
44.98 - baseline_glp_control:
-77.3 -
random_glp:
-74.75 -
Matched
response_last—evil - selected_raw:
-60.25 - selected_glp:
-54.15 - raw coherence:
34.4 - glp coherence:
31.93 - baseline_glp_control:
-62.45 -
random_glp:
-58.67 -
Matched
response_last—sycophancy - selected_raw:
-72.15 - selected_glp:
-71.0 - raw coherence:
50.8 - glp coherence:
49.45 - baseline_glp_control:
-73.5 - random_glp:
-75.65
Interpretation: GLP effects are not selective enough relative to nuisance controls.
4) Geometry suggests generic projection behavior
Observed ranges across matched runs:
- repair-to-edit ratio: ~
2.0–2.16 - edit-retention cosine: ~
0.39–0.42
In practice, GLP often makes moves larger than the original edit while preserving under half of its directional alignment. This looks more like a generic denoising projector than direction-preserving repair.
5) Better optimization did not resolve selectivity in the matched response_all lane
After addressing undertraining critiques and improving validation loss materially in the matched response_all lane, behavior-level selectivity still did not improve enough. That weakens “insufficient optimization” as the main explanation for that lane, even though it does not fully settle the training-adequacy question for response_last.
6) Conditional pilot worked technically, but likely targeted the wrong objective
prompt_last -> response_last conditional training functioned as infrastructure, but likely optimized a normal-response mapping rather than edit-preserving repair.
Most Important New State
We now have a mixed clean+edited response_last checkpoint that directly addresses the clean-train / edited-eval mismatch criticism.
Dataset:
8600total6880clean (80%)1720edited (20%)- balanced across:
sycophancy_plussycophancy_minusevil_plusevil_minus
Training note:
- validation loss improved from
1.877(clean-only) to1.855(mixed), under matched compute
This is not the branch result by itself. The key pending result is Week 2 behavioral evaluation on this mixed checkpoint.
Claim Boundary
What is supported now:
- naive reuse of the released GLP checkpoint is not supported for this lane
- matched GLP reduces gross mismatch pathology
- current GLP behavior is still nonselective in the Week 2 branch framing
- failure mode appears structured (generic projection), not random
What is not yet supported:
- selective, direction-preserving repair at claim-grade confidence
- broad anti-GLP claims outside this task regime
Methods Snapshot
- Primary model and hook:
meta-llama/Llama-3.1-8B-Instruct, layer12,blocks.12.hook_resid_post - Validated Week 2 comparison setting:
20held-out prompts per trait,max_new_tokens=32,temperature=0,random_direction_draws=3 - Traits in the branch-local claim lane:
sycophancyand historicalevilbranch naming - Judge setup:
claude-sonnet-4-6for scored branch evaluation; the mainline dual-judge audit usesclaude-opus-4-6as the secondary calibration model - Training defaults for the matched checkpoints:
lr=5e-5,batch_size=512,validation_fraction=0.05, cosine scheduler with warmup,3epochs in the validated comparison runs - Training datasets used in the current post:
response_all:92,422activations from3,000prompts- clean
response_last:8,600activations from8,600prompts - mixed
response_last:8,600activations with20%edited samples - Mixed-data composition:
1,720edited samples, balanced430each acrosssycophancy_plus,sycophancy_minus,evil_plus,evil_minus
Branch interpretation rule of thumb:
- a checkpoint is only treated as claim-aligned when model and layer match the target lane
- behavioral repair is only interesting if
selected_glpbeats the nuisance baselines, not merely if it looks okay in isolation - NLL / geometry metrics are treated as diagnostics, not as validated behavioral surrogates
Uncertainty and Variance Notes
The current branch read is mean-based, but the prompt-level variance is large enough that I do not want readers to infer more stability than the artifacts support.
Validated 20-prompt runs:
| Checkpoint | Trait | GLP-minus-raw effect delta mean | Std | Range | GLP-minus-raw coherence delta mean | Std | Range |
|---|---|---|---|---|---|---|---|
matched response_all |
sycophancy |
-0.35 |
11.54 |
[-13, 23] |
-6.00 |
9.55 |
[-27.0, 11.5] |
matched response_all |
evil |
4.95 |
13.50 |
[-12, 37] |
-1.68 |
8.94 |
[-20.0, 16.5] |
matched response_last |
sycophancy |
1.15 |
9.02 |
[-18, 20] |
-1.35 |
11.20 |
[-28.5, 21.5] |
matched response_last |
evil |
6.10 |
15.38 |
[-17, 52] |
-2.48 |
9.08 |
[-17.0, 12.0] |
What this means:
- GLP is not uniformly harmful or uniformly helpful across prompts.
- The current failure mode looks unstable and generic, not cleanly trait-selective.
- Random-direction controls are now averaged over
3draws, which is better than the original single-draw control, but still not enough to claim seed-level stability.
Proxy-metric warning:
evilNLL-vs-coherence delta Spearman:0.096evilrepair-ratio-vs-coherence delta Spearman:-0.218sycophancyNLL-vs-coherence delta Spearman:0.298
Those are too weak to justify treating the proxy metrics as behavioral stand-ins.
Artifact Index For Numeric Claims
- Released checkpoint failed to transfer cleanly
- Numeric claim: large distortion on
evilalpha-3 diagnostic:delta_target_nll_vs_clean=4.725,kl_clean_to_hooked=5.184 -
Artifact:
week2_glp_sidecar_validation_20260311T012700Z_evil_frontier_alpha3_nlldiag_20260310a.json -
Matched
response_allstill looked nonselective - Numeric claim:
evil: selected GLP-54.65vs baseline-61.5vs random-59.5;sycophancy: selected GLP-72.25vs baseline-77.3vs random-74.75 -
Artifact:
week2_glp_sidecar_validation_20260312T151500Z_matched_responseall_val3e_rowdiag20_20260312a.json -
Matched
response_allgeometry looked like generic projection - Numeric claim: repair-to-edit ratio
2.0092/2.0222; retention cosine0.4182/0.4235 -
Artifact:
week2_glp_sidecar_analysis_20260312T155851Z.json -
Matched
response_laststill looked nonselective - Numeric claim:
evil: selected GLP-54.15vs baseline-62.45vs random-58.67;sycophancy: selected GLP-71.0vs baseline-73.5vs random-75.65 -
Artifact:
week2_glp_sidecar_validation_20260313T135047Z_matched_responselast_val3e_rowdiag20_20260313b.json -
Matched
response_lastgeometry still looked like generic projection - Numeric claim: repair-to-edit ratio
2.1577/2.1620; retention cosine0.3932/0.3948 -
Artifact:
week2_glp_sidecar_analysis_20260313T135951Z.json -
Better
response_alloptimization did not translate into selective repair - Numeric claim: validation loss improved to
1.5969at epoch2and1.5795at epoch3, but Week 2 selectivity still failed -
Artifacts:
train_glp_matched_modal_20260312T133750Z_responseall_val3e_20260312a.json;week2_glp_sidecar_validation_20260312T151500Z_matched_responseall_val3e_rowdiag20_20260312a.json -
Clean
response_lastremained a low-step regime - Numeric claim:
8,170train examples,430val examples,15gradient steps/epoch, final val loss1.8770 -
Artifact:
train_glp_matched_modal_20260313T023542Z_response_last_tranches1234_val3e_20260312a.json -
Mixed clean+edited checkpoint is now trained
- Numeric claim:
8,600total samples,1,720edited (20%), balanced430per edit label, final val loss1.8551 - Artifacts:
glp_export_mixed_edited_memmap_dataset_20260315T075043Z_response_last_mixed20_tranches1234_20260313a.json;train_glp_matched_modal_20260315T075513Z_response_last_mixed20_tranches1234_val3e_20260313a.json
What To Do Next
- Evaluate Week 2 behavior on the mixed-trained checkpoint.
- If still nonselective, substantially lower confidence in unconditional GLP for this application.
- Only if mixed training improves selectivity should we invest in larger targeted-objective work (edit-fraction sweeps, stronger conditional/edit-aware repair objectives).
Current Bottom Line
We tested GLP as a geometry disambiguation tool for persona steering. The public checkpoint failed to transfer cleanly in this setting. Matched checkpoints were more stable but still too nonselective, with geometry consistent with generic projection behavior. The mixed clean+edited checkpoint is now trained and creates a real next inflection test; its behavioral evaluation is the decisive next step.
Project Links
- Research hub: https://sohailmo.ai/pages/research/
- Code and artifacts: https://github.com/Sohailm25/persona-circuits