← Back
Research

Persona Circuits: Exploring GLP Application


The GLP branch was designed to answer one concrete question: when steering fails, are we seeing a bad semantic direction or a geometrically invalid edited state?

This post is organized to answer that question directly before moving into branch details.

Why This Branch Matters

If a repair prior can preserve semantic edits while restoring plausibility, then some apparent circuit failures may be geometric rather than semantic. If it cannot, we should be more cautious about using latent priors as mechanistic disambiguation tools.

GLP Reference and Why We Tried It

GLP was a natural candidate because, in principle, it can repair geometry while preserving intended directional edits.

Evidence Scope

This post covers one bounded branch question:

Can a learned latent prior (GLP) repair steered activations in a way that preserves intended semantic edits while improving geometric validity?

Included evidence:

Claims are limited to this branch, in this model/layer/protocol setting.

Project Context (for new readers)

Persona-circuits is an ongoing mechanistic interpretability project testing whether persona-like behavioral steering directions in LLMs correspond to sparse, causally meaningful internal structure. The current evidence supports robust steering and partial concentration structure, but several stronger causal claims (especially sufficiency and distinctness) remain mixed or negative under current protocols.

Why We Ran This Branch

In persona-circuits, the central challenge is not only moving a trait score. It is distinguishing:

GLP was promising as a disambiguation tool: repair geometry without erasing meaningful directional edits.

Explicit Branch Hypotheses

This branch tests three explicit hypotheses:

Naming Note (Consistency With Mainline)

This branch contains historical metrics labeled evil from earlier branch-stage naming. In the mainline synthesis, that construct is reframed as machiavellian_disposition due to refusal confounding. The GLP tables below preserve original branch labels for traceability.

What We Built

This became a full sidecar rather than a single checkpoint test:

Controls that matter most:

Main Findings So Far

1) Public GLP checkpoint did not transfer cleanly

In this setting, the released checkpoint produced large local predictive distortion and overly strong GLP-only/random controls. That behavior is inconsistent with a selective repair interpretation.

2) Matched checkpoints helped stability, not selectivity

Model/layer-matched checkpoints were less pathological than the public checkpoint. But the key branch question remained unresolved:

Does GLP help selected steering more than baseline/random controls?

So far, mostly no.

3) Control competitiveness remains too high

Representative validated reads (compact format):

Interpretation: GLP effects are not selective enough relative to nuisance controls.

Selected GLP versus baseline and random control effects in the validated Week 2 runs

4) Geometry suggests generic projection behavior

Observed ranges across matched runs:

In practice, GLP often makes moves larger than the original edit while preserving under half of its directional alignment. This looks more like a generic denoising projector than direction-preserving repair.

5) Better optimization did not resolve selectivity in the matched response_all lane

After addressing undertraining critiques and improving validation loss materially in the matched response_all lane, behavior-level selectivity still did not improve enough. That weakens “insufficient optimization” as the main explanation for that lane, even though it does not fully settle the training-adequacy question for response_last.

6) Conditional pilot worked technically, but likely targeted the wrong objective

prompt_last -> response_last conditional training functioned as infrastructure, but likely optimized a normal-response mapping rather than edit-preserving repair.

Most Important New State

We now have a mixed clean+edited response_last checkpoint that directly addresses the clean-train / edited-eval mismatch criticism.

Dataset:

Training note:

This is not the branch result by itself. The key pending result is Week 2 behavioral evaluation on this mixed checkpoint.

Claim Boundary

What is supported now:

What is not yet supported:

Methods Snapshot

Branch interpretation rule of thumb:

Uncertainty and Variance Notes

The current branch read is mean-based, but the prompt-level variance is large enough that I do not want readers to infer more stability than the artifacts support.

Validated 20-prompt runs:

Checkpoint Trait GLP-minus-raw effect delta mean Std Range GLP-minus-raw coherence delta mean Std Range
matched response_all sycophancy -0.35 11.54 [-13, 23] -6.00 9.55 [-27.0, 11.5]
matched response_all evil 4.95 13.50 [-12, 37] -1.68 8.94 [-20.0, 16.5]
matched response_last sycophancy 1.15 9.02 [-18, 20] -1.35 11.20 [-28.5, 21.5]
matched response_last evil 6.10 15.38 [-17, 52] -2.48 9.08 [-17.0, 12.0]

What this means:

Proxy-metric warning:

Those are too weak to justify treating the proxy metrics as behavioral stand-ins.

Artifact Index For Numeric Claims

What To Do Next

  1. Evaluate Week 2 behavior on the mixed-trained checkpoint.
  2. If still nonselective, substantially lower confidence in unconditional GLP for this application.
  3. Only if mixed training improves selectivity should we invest in larger targeted-objective work (edit-fraction sweeps, stronger conditional/edit-aware repair objectives).

Current Bottom Line

We tested GLP as a geometry disambiguation tool for persona steering. The public checkpoint failed to transfer cleanly in this setting. Matched checkpoints were more stable but still too nonselective, with geometry consistent with generic projection behavior. The mixed clean+edited checkpoint is now trained and creates a real next inflection test; its behavioral evaluation is the decisive next step.