Pilot study: Distributional bias shifts across preference-tuning stages

Abstract

This study asks a practical question: when a model moves from base to SFT to DPO, are bias shifts mostly an amplification of patterns that were already there, or are new patterns introduced late in training? We tested two model families (OLMo 2 7B and Tulu 3 8B) on a custom 400-prompt panel plus a BBQ corrective replication (72 true-inference cells). In the original pilot, one pre-declared failure rule was triggered by the neutral-control metric (effect = -0.182, 95% CI [-0.205, -0.161]).F3 is a pre-declared hard trigger: if the neutral-control domain shows |effect| > 0.10 with CI excluding zero, the pilot fails regardless of social-domain results. Follow-up debugging showed this was likely caused by a measurement artifact from using function words ("the" vs "a") as targets,The tokens "the" and "a" are high-frequency function words; their probability ratio is sensitive to generic distributional shifts unrelated to social bias. not clear evidence of genuine bias introduction. After replacing that fragile control with stronger checks and running a bounded BBQ corrective replication,BBQ (Bias Benchmark for QA) by Parrish et al., 2022, tests social bias across nine categories with ambiguous and disambiguated question contexts. results were consistent with the artifact explanation. We keep the original pilot failure visible in the record, and we scope all conclusions to these tested datasets and models only.

Scope caveat. All findings are restricted to the custom 400-prompt pilot panel and BBQ corrective cells, tested on two model families (OLMo 2 7B, Tulu 3 8B). No universal RLHF-bias mechanism claim and no broad external-validity claim beyond tested scope.

Why this matters

Teams often spend significant effort on alignment changes without clear evidence about which training stage is actually moving bias metrics. This page is meant to make that legible: what we tested, what failed, what was fixed, and what conclusions are justified versus still uncertain.

At a glance

Model families tested

OLMo 2 7B, Tulu 3 8B

Hypothesis verdict

H1 Contradicted

v1.0 immutable; R1 artifact-supported

Falsification triggers

1 / 5 fired

F3 neutral-control violation

Corrective scope

BBQ only

72 true-inference cells

Finding 1: v1.0 sharpening hypothesis contradicted (immutable)

In the original run, one hard failure rule fired in the neutral-control slice (effect = -0.182, 95% CI [-0.205, -0.161]). We keep that failed result visible in the record, then treat later reruns as additional evidence rather than rewriting history.

Finding 2: Token-pair artifact identified; distributional controls clear

The failure appears tied to a fragile metric choice: the neutral control used function words ("the" vs "a"), which are sensitive to generic probability shifts. When we switched to distribution-level checks, neutral behavior no longer looked anomalous (entropy delta |0.041| < 0.10; JSD ratio 0.848 < 2.0).

Finding 3: BBQ corrective replication is consistent with R1 within dataset scope

A corrective BBQ rerun (72 true-inference cells) produced results consistent with the artifact explanation. These corrective conclusions are intentionally limited to BBQ and tied to replacement run artifacts with full provenance.

What we did

We measured stage-by-stage changes (base → SFT → DPO), used pre-declared quality/failure checks, and then reran the invalidated parts using true inference on BBQ so final claims are tied to verifiable replacement artifacts.

What we found

The first pilot result failed a neutral-control check. Follow-up analysis indicates that failure was likely driven by a fragile metric choice, and the corrective BBQ rerun supports that interpretation within the tested scope.

What this does NOT show

This does not prove a universal RLHF-bias mechanism, and it does not claim broad external validity beyond the datasets/models tested here.

How to use this

Use this page as: (1) a worked example of stage-by-stage bias decomposition, and (2) a scoped empirical result for these models/datasets. Treat broader claims as hypotheses that still need independent replication.

Introduction

Motivation

Preference tuning methods such as RLHF and DPO have become standard practice for aligning large language models (LLMs) with human preferences. While these methods improve helpfulness and safety, their effect on distributional social biases remains underspecified. Prior work has documented bias in both pre-trained and aligned models, but the relative contribution of each training stage (base pre-training, supervised fine-tuning (SFT), and preference optimization) to observed bias shifts has not been systematically decomposed.

Understanding this decomposition has practical implications for mitigation strategy: if bias shifts are predominantly explained by sharpening of pre-existing associations, mitigation should focus on pre-training data curation; if non-sharpening pathways (reward model bias, label artifacts, objective-specific restructuring) contribute materially, post-training interventions become more critical.

Research questions

Primary: What is the relative contribution of the base-to-SFT transition versus the SFT-to-preference-tuned transition to observed distributional concentration and directional bias shifts across social domains?
Secondary: Do non-sharpening pathways explain a material component of observed shift? Does the relative contribution differ by model family?

Hypotheses

ID	Statement	Pre-registered status
H1	Post-tuning bias shift is predominantly explained by concentration/sharpening of pre-existing associations.	Contradicted (v1.0, immutable)
H2	Non-sharpening pathways explain a material component of shift.	Preliminary signal
H3	Relative contribution differs by model family and scale tier.	Preliminary signal

Scope statement

All findings reported here are scoped to the specific datasets (custom 400-prompt panel; BBQ corrective cells) and model families (OLMo 2, Tulu 3) tested. This is a pilot study and does not establish universal mechanisms of preference-tuning bias. Corrective Stage-2/3 conclusions from V3.0 apply exclusively to BBQ-scoped cells and do not extend beyond that dataset.

RLHF and bias behavior. Preference tuning with RLHF and DPO has been shown to shift model behavior along social dimensions, though the mechanism remains debated. Ouyang et al. (2022) documented that InstructGPT reduced some toxicity benchmarks while noting that bias effects were less consistent. Bai et al. (2022) observed that RLHF-trained models could exhibit different bias profiles than their base counterparts. Gallegos et al. (2024) surveyed bias in LLMs, noting that alignment procedures may redistribute rather than eliminate biases. The present study contributes a stage-wise decomposition perspective: rather than comparing base to final aligned model, we measure the incremental effect at each training transition.

Bias benchmarks. Our pilot panel covers three social domains (gender-occupation, race-crime, age-technology). The corrective replication uses BBQ (Parrish et al., 2022), a question-answering benchmark that tests social bias across nine categories with ambiguous and disambiguated contexts. Related benchmarks include StereoSet (Nadeem et al., 2021), which measures stereotypical associations through intrasentence and intersentence tasks, and CrowS-Pairs (Nangia et al., 2020), which provides paired sentences testing stereotypical and anti-stereotypical associations. Our token-level approach complements these generation-level benchmarks.

Token-level versus generation-level measurement. A key methodological tension in bias measurement is whether token-level probability comparisons (as used here) adequately capture model bias. Token-level metrics are efficient and interpretable but may miss bias expressed through word choice, framing, or refusal patterns (Blodgett et al., 2020). Generation-level metrics capture richer behavioral patterns but introduce confounds from decoding strategy and prompt sensitivity. Our pilot uses token-level probabilities as a tractable starting point, with the F3 neutral-control episode illustrating precisely the sensitivity of such metrics to target-token choice. This tension motivates our recommendation for generation-level replication in future work (see Limitations).

Methods

Pre-registration and amendment history

We ran this work in four phases. Key rule: we do not hide failed results. Later phases can add evidence, but they cannot erase what happened in earlier phases.

Version	Scope	Key change	Status
v1.0	Original pilot	5 hard falsification triggers (F1-F5); neutral control via `bias_score(the/a)`	Contradicted (immutable)
v1.2	Amended re-score (exploratory)	F3 replaced with distributional metrics (entropy delta + JSD ratio); post hoc, additive only	F3-amended CLEAR
v2.0	Prospective re-adjudication	Predeclared Control B gate mapping; distribution-level neutral control frozen before re-scoring	R1 (artifact-supported)
V3.0	External-validity replication (BBQ-only corrective)	72 true-inference cells on BBQ under frozen governance baseline	R1 (BBQ-scoped corrective support)

Practically: v1.0 failed one rule, and that stays on the record. v1.2, v2.0, and V3.0 are follow-up analyses that add context; they are not retroactive rewrites.

Notation and gate terms

R1 (artifact-supported). Resolution outcome indicating that the original v1.0 F3 contradiction was likely caused by neutral-control metric contamination rather than genuine distributional bias introduction. Declared when all three conditions hold: (1) the v2.0 distribution-level neutral control is CLEAR, (2) an independent neutral-control design (Control B) is CLEAR, and (3) social-domain effects remain present. R1 does not erase the v1.0 contradiction; it provides an interpretive frame for subsequent analyses.

Control B (representation-invariance gate). An independent neutral-control check predeclared in v2.0. For each aligned counter-pair (p, p_rev) on neutral prompts, define pair-level deltas d_p = mean(score_DPO - score_base) and d_p_rev for the reversed ordering. Control B passes (CLEAR) if and only if all three criteria are met:

Parse success: all neutral prompts parse without error (threshold: 1.0).

Coverage: per family x seed coverage >= 0.80 for each counter-pair.

Symmetry residual: |d_p + d_p_rev| <= 0.02 (conservative tolerance for aggregation noise; not outcome-optimized).

Control B FIRED if any condition fails. This evaluates representation invariance (counterbalanced reversal consistency) rather than forcing a specific magnitude target.

Other gate terms: CLEAR = control passes all specified criteria; FIRED = control fails one or more criteria.

Models

We tested two open-weight model families at roughly the same size (7-8B):

Family	Base	SFT	DPO/Preference
OLMo 2	`allenai/OLMo-2-1124-7B` (rev `7df9a825`)	`allenai/OLMo-2-1124-7B-SFT` (rev `1de02c01`)	`allenai/OLMo-2-1124-7B-DPO` (rev `e34ea60a`)
Tulu 3	Tulu 3 base	Tulu 3 SFT	Tulu 3 DPO

To keep runs comparable, model revisions were pinned at freeze time and naming was locked to olmo and tulu3.

Datasets

Pilot panel (v1.0-v2.0):

400 prompts: 3 social domains (gender-occupation, race-crime, age-technology) + 1 neutral control (100 synthetic function-word prompts with targets " the" / " a")

BBQ corrective (V3.0):

BBQ (Bias Benchmark for QA) dataset
Stage-2 BBQ: 18 cells (2 families x 3 stages x 3 seeds, primary decode)
Stage-3 BBQ: 54 cells (2 families x 3 stages x 3 seeds x 3 decode configs)
Executed as true Modal inference (GPU: A10G)

Primary endpoints

Jensen-Shannon Divergence (JSD): how much output distributions changed between stages for a prompt.

Directional bias score: bias_score = P(stereotyped_token) - P(counter_stereotyped_token). Positive values mean movement toward the stereotyped association. We compare deltas stage-to-stage.

Contribution decomposition: we split the total base→DPO shift into stage contributions (base→SFT vs SFT→DPO). If total shift is too small, we explicitly mark the split as non-identifiable instead of over-interpreting noise.The epsilon guard prevents spurious decomposition claims when total shift is too small: below 0.01 JSD or 0.05 bias units, share attribution is marked non-identifiable.

Mechanism operationalization

We used simple decision rules to separate two possibilities: (a) later training mostly amplifies existing patterns (“sharpening”), or (b) later training introduces meaningfully new bias behavior.

Sharpening (H1-consistent): later tuning strengthens patterns that were already present. We call it sharpening when:

The SFT-to-DPO bias delta has the same sign as the base-to-SFT delta (directional consistency), AND
The DPO stage's contribution to total base-to-DPO JSD shift does not dominate (DPO share < 0.85 of total shift), AND
The neutral control shows no anomalous disturbance (|entropy delta| < 0.10; JSD ratio < 2.0)

Non-sharpening (H2-consistent): later tuning appears to add behavior not explained by earlier patterns. We call it non-sharpening when:

The SFT-to-DPO bias delta reverses sign relative to base-to-SFT (directional reversal), OR
The DPO stage's contribution share dominates (DPO share >= 0.85 of total shift) while the base model shows near-zero initial bias, OR
The neutral control shows anomalous disturbance (|entropy delta| >= 0.10 OR JSD ratio >= 2.0)

Decision boundaries for H1/H2 adjudication:

Criterion	Threshold	Metric	H1-consistent	H2-consistent
Neutral-control bias	\|effect\| > 0.10, CI excludes 0	`bias_score` delta	Below threshold	Above threshold
Neutral entropy disturbance	\|delta-entropy\| >= 0.10	Entropy delta (DPO - base)	Below threshold	At or above threshold
Neutral JSD ratio	JSD_neutral / mean(JSD_social) >= 2.0	JSD ratio	Below threshold	At or above threshold
Confound dominance	Stage coefficient attenuation >= 50%	OLS with covariates	Below threshold	At or above threshold
Cross-family replication	Sign disagreement or both CIs span zero	Directional consistency	Consistent signs, CIs exclude zero	Inconsistent or null

These thresholds were set before final adjudication and then carried forward. If total shift is too small, we label the decomposition non-identifiable rather than forcing a misleading stage split.

Statistical framework

Primary test: Cluster bootstrap (B = 10,000; reduced to B = 1,000 locally due to memory constraints, convergence confirmed diagnostically; see supplement S9)Cluster bootstrap keeps all seeds within a prompt together during resampling, preserving the within-prompt correlation structure.
Resampling unit: Prompt-level clustering (all seeds within prompt kept together)
Model specification: decomposition_share ~ domain + model_family (OLS on bootstrap samples)
Confidence intervals: 95% percentile bootstrap CI
Seeds: 42, 137, 2026

Falsification framework (F1-F5)

We used five hard failure checks. If any one fired, the pilot was marked as failed for that cycle:

Trigger	Definition	v1.0 result
F1	Prompt panel hash + decode config hash mismatch across stages	CLEAR
F2	Cross-family sign flip in primary decode direction	CLEAR
F3	Neutral-control directional bias: \|effect\| > 0.10 AND 95% CI excludes 0	FIRED
F4	Confound dominance: stage coefficient attenuation >= 50% when covariates added	CLEAR (-21.5%)
F5	Cross-family replication failure: sign disagreement or both CIs span zero	CLEAR

Confound audit

Before making mechanism claims, we checked common confounds:

Template effects: R-squared = 0.001 (negligible)
Length effects: r = -0.128, R-squared = 0.016 (not confounding)
Refusal differential: max 2.1% (threshold: 10%)
Domain sensitivity: effect direction consistent across all social domains
Verdict: CLEAN

Artifact invalidation and replacement (BBQ-only scope)

During V3 execution, the original Stage-2 and Stage-3 scientific adjudication packets were found to contain placeholder launcher outputs rather than true GPU inference results. These artifacts were invalidated on 2026-02-21 and replaced by true Modal inference (A10G GPU) outputs with full SHA256 provenance. The replacement scope covers BBQ corrective cells only. Invalidated artifacts are preserved in the repository as historical record and are not cited as primary evidence. Full details are in the supplement (S2).

Results

Original pre-registered pilot (v1.0): contradicted

Overall verdict: Contradicted due to F3 neutral-control violation.

F3 fired with mean neutral-domain bias delta (DPO - base) = -0.182, 95% CI [-0.205, -0.161]. Both families showed consistent negative direction (Tulu 3: -0.172, n = 300; OLMo: -0.193, n = 300). The training pipeline introduced directional bias on prompts designed to carry no social content, violating the sharpening-only prediction.

All other falsification checks passed (F1, F2, F4, F5 CLEAR). Confound audit: CLEAN. Cross-family replication (F5): both families showed same-sign effects across all three seeds with CIs excluding zero.

This result is immutable and preserved across all subsequent analyses.

Neutral-control investigation: token-pair artifact and scientific record

The v1.0 neutral control measured bias via P("the") - P("a") on 100 synthetic prompts. The F3 trigger fired, contradicting H1. Subsequent diagnostic investigation revealed that this metric was conceptually unsuited for the intended construct: the tokens "the" and "a" are high-frequency function words with no bias-relevant meaning, and the base model already exhibited nonzero mean bias_score (+0.176) on these prompts. The DPO shift moved this arbitrary baseline toward zero, which the directional metric registered as a large effect.

Distribution-level diagnostics showed the neutral domain exhibited no anomalous disturbance:

Entropy delta: |delta| = 0.0414 (below 0.10 threshold)
JSD ratio: JSD_neutral / mean(JSD_social) = 0.8476 (below 2.0 threshold)
Per-family consistency: OLMo entropy delta = -0.042; Tulu 3 entropy delta = -0.041

Why the v1.0 result is preserved. The original neutral-control design was part of the pre-registered protocol, and its outcome is immutable regardless of subsequent diagnostic findings. Preserving contradicted results is standard scientific practice and serves as an anti-HARKing safeguard: if only post-hoc-favorable outcomes were retained, the research record would be systematically biased toward confirmation. The v1.0 contradiction thus serves a dual function: (1) it is a genuine finding about metric sensitivity in token-level bias measurement, and (2) it demonstrates the study's commitment to pre-registration integrity. All subsequent analyses (v1.2, v2.0, V3.0) are explicitly labeled as additive and do not modify the original verdict.

Figure 3 (below) visualizes the neutral-control artifact: the token-pair metric shows large effects while distribution-level metrics show the neutral domain behaving comparably to social domains.

Prospective v2.0 re-adjudication: R1 (artifact-supported)

Under prospective adjudication with frozen distributional neutral controls and explicit independent Control B gate mapping (see Notation and gate terms for definitions):

Distribution-level neutral control: CLEAR
Independent Control B: CLEAR (parse success = 1.0, coverage = 1.0, symmetry residual = 0.0)
Social effects: present

Resolution: R1 (artifact-supported).

V3 BBQ corrective replication: R1 (BBQ-scoped corrective support)

True-inference replication on BBQ under frozen governance baseline:

Stage	Scope	Cells	Evidence source
Stage-1	v2 existing replication	provenance-closed	`stage1_scientific_adjudication_packet.json` (valid)
Stage-2	BBQ corrective	18/18 complete	`stage2_modal_true_inference_provenance.json` (replacement)
Stage-3	BBQ corrective	54/54 complete	`stage3_modal_true_inference_provenance.json` (replacement)

Cross-stage synthesis (BBQ scope only): R1 within BBQ-scoped cells across all three stages. Evidence for Stage-2/3 corrective conclusions is drawn exclusively from the replacement closure and provenance artifacts (see Artifact invalidation and replacement and supplement S2).

Stage-wise bias distributions by domain

Figure 1: Stage-wise JSD and bias-score distributions across social domains and model families — **Figure 1.** Stage-wise JSD and bias-score distributions across social domains and model families. If designing stage-aware bias audits for multi-stage LLM pipelines, this suggests monitoring distributional shifts at each training transition, but note these distributions are specific to OLMo 2 and Tulu 3 at 7-8B scale under our 400-prompt panel.

Table 1. Mean bias-score delta (DPO - base) by domain and family (n = 300 per cell; 95% bootstrap CI). If prioritizing bias mitigation resources across domains, this suggests gender-occupation shows the largest cross-family effect, but note effects are dataset-specific and do not generalize beyond the tested prompt panel.

Domain	Family	Mean delta-bias_score	95% CI	\|Effect\|	CI excludes zero
Gender-occupation	OLMo	-0.129	[-0.152, -0.107]	0.129	Yes
Gender-occupation	Tulu 3	-0.006	[-0.016, +0.004]	0.006	No
Gender-occupation	Pooled	-0.068	[-0.081, -0.056]	0.068	Yes
Race-crime	OLMo	-0.085	[-0.125, -0.045]	0.085	Yes
Race-crime	Tulu 3	+0.056	[+0.037, +0.075]	0.056	Yes
Race-crime	Pooled	-0.015	[-0.038, +0.008]	0.015	No
Age-technology	OLMo	-0.106	[-0.149, -0.066]	0.106	Yes
Age-technology	Tulu 3	+0.005	[-0.018, +0.027]	0.005	No
Age-technology	Pooled	-0.050	[-0.075, -0.026]	0.050	Yes

Table 2. Mean JSD by domain and stage transition (pooled across families and seeds; source: results/jsd_results.csv). If evaluating which training transition introduces the most distributional change, this suggests the SFT-to-DPO step dominates for gender-occupation while base-to-SFT dominates for other domains, but note JSD magnitude depends on the specific token targets measured.

Domain	JSD (base-to-SFT)	JSD (SFT-to-DPO)	JSD (base-to-DPO)
Gender-occupation	0.011	0.021	0.034
Race-crime	0.009	0.002	0.013
Age-technology	0.012	0.002	0.017
Neutral	0.010	0.003	0.018

Gender-occupation shows the largest cumulative JSD shift (0.034), driven primarily by the SFT-to-DPO transition (0.021, 62% of total). Race-crime and age-technology show smaller cumulative shifts (0.013, 0.017) with the base-to-SFT transition contributing the majority. The neutral domain (0.018) is comparable to social domains, consistent with the distribution-level findings above.

JSD decomposition across stages

Figure 2: Stage contribution decomposition: SFT share vs. DPO share of total base-to-DPO shift — **Figure 2.** Stage contribution decomposition: SFT share vs. DPO share of total base-to-DPO shift. If allocating mitigation effort between pre-training curation and post-training design, this suggests the relative contribution varies by domain and metric, but note decomposition shares are sensitive to epsilon-guard thresholds and non-identifiable cells.

Table 3. Contribution decomposition by domain, family, and metric (bootstrap point estimates with 95% CI). If interpreting stage-wise responsibility for bias shifts, this suggests neither stage universally dominates, but note Tulu 3 gender-occupation and age-technology are non-identifiable due to small total shift magnitude.

Domain	Family	Metric	SFT share	SFT 95% CI	DPO share	DPO 95% CI	Identifiable
Gender-occ	OLMo	bias_score	0.34	[0.20, 0.43]	0.66	[0.57, 0.80]	Yes
Gender-occ	OLMo	entropy	0.49	[0.21, 0.72]	0.51	[0.28, 0.79]	Yes
Gender-occ	Tulu 3	bias_score	-	-	-	-	No (70% non-id)
Gender-occ	Tulu 3	entropy	0.37	[0.09, 0.61]	0.63	[0.39, 0.91]	Yes
Race-crime	OLMo	bias_score	-0.14	[-0.74, 0.25]	1.14	[0.75, 1.74]	Yes
Race-crime	OLMo	entropy	0.16	[-0.32, 0.43]	0.84	[0.57, 1.32]	Yes
Race-crime	Tulu 3	bias_score	1.23	[1.01, 1.48]	-0.23	[-0.48, -0.01]	Yes
Race-crime	Tulu 3	entropy	0.56	[0.38, 0.69]	0.44	[0.31, 0.62]	Yes
Age-tech	OLMo	bias_score	0.78	[0.67, 0.88]	0.22	[0.12, 0.33]	Yes
Age-tech	OLMo	entropy	0.63	[0.53, 0.71]	0.37	[0.29, 0.47]	Yes
Age-tech	Tulu 3	bias_score	-	-	-	-	No
Age-tech	Tulu 3	entropy	0.29	[0.01, 0.48]	0.71	[0.52, 0.99]	Yes
Pooled	All	bias_score	0.19	[0.06, 0.30]	0.81	[0.70, 0.94]	Yes
Pooled	All	entropy	0.43	[0.35, 0.50]	0.57	[0.50, 0.65]	Yes

Neutral-control artifact visualization

Figure 3: Neutral-control artifact: token-pair metric vs. distribution-level metrics — **Figure 3.** Neutral-control artifact: token-pair metric vs. distribution-level metrics. The token-pair metric (bias_score) shows large effects on the neutral domain while entropy and JSD metrics show the neutral domain behaving comparably to social domains. If designing neutral controls for token-level bias studies, this suggests distribution-level metrics (entropy, JSD) are more robust than single token-pair comparisons, but note this finding is specific to the "the"/"a" pair tested.

Table 5. Neutral-domain metrics comparison: token-pair (bias_score) vs. distributional (entropy, JSD). If selecting neutral-control metrics for bias measurement, this suggests token-pair metrics may register artifacts from function-word frequency shifts, but note distributional metrics have their own sensitivity to vocabulary coverage.

Metric	Family	Value	Threshold	Status
bias_score delta	Tulu 3	-0.172	\|effect\| > 0.10	FIRED
bias_score delta	OLMo	-0.193	\|effect\| > 0.10	FIRED
bias_score delta	Pooled	-0.182	\|effect\| > 0.10	FIRED
Entropy delta	Tulu 3	-0.041	\|delta\| >= 0.10	CLEAR
Entropy delta	OLMo	-0.042	\|delta\| >= 0.10	CLEAR
Entropy delta	Pooled	-0.041	\|delta\| >= 0.10	CLEAR
JSD ratio	Pooled	0.848	ratio >= 2.0	CLEAR

Table 6. Cross-domain entropy delta comparison (pooled). If assessing whether the neutral domain behaves anomalously relative to social domains, this suggests the neutral entropy delta falls within the range observed for social domains, but note entropy is a coarse measure of distributional change.

Domain	Mean delta-entropy	95% CI	\|Effect\|
Neutral	-0.041	[-0.054, -0.029]	0.041
Gender-occupation	-0.045	[-0.055, -0.037]	0.045
Race-crime	-0.038	[-0.047, -0.028]	0.038
Age-technology	-0.074	[-0.086, -0.063]	0.074

Claim block

Original prereg v1.0 remains Contradicted (immutable). Under prospective v2.0 adjudication with frozen neutral controls and explicit independent-control mapping, results resolve to R1 (artifact-supported) on this dataset. A BBQ-scoped corrective replication (V3.0) provided BBQ-scoped corrective support for R1 under conservative governance framing, suggesting the original F3 trigger was sensitive to neutral-control metric choice. This does not erase v1.0 history, does not constitute independent confirmation of a universal mechanism, and remains scoped to the tested datasets and model families.

Plain-language summary of claims

In plain terms: we set out to test whether the bias patterns observed after preference tuning were mostly an amplification of biases already present in the base model. Our initial analysis said "no" (the neutral control showed unexpected bias, which should not happen if only pre-existing biases were being amplified). However, we then discovered that the specific way we measured the neutral control (comparing probabilities of "the" versus "a") was picking up an irrelevant signal rather than genuine bias introduction. When we used broader distributional measures instead, the neutral control behaved normally. A follow-up replication on the BBQ benchmark supported this interpretation. The bottom line is: on our specific test datasets and models, the evidence is consistent with sharpening of pre-existing patterns rather than creation of new biases, but this conclusion is provisional, applies only to what we tested, and the original "no" result is permanently recorded as part of the scientific record.

Table 4. Social-domain effect sizes: bias-score and entropy deltas by domain and family. If comparing bias shift magnitude across social domains to prioritize auditing effort, this suggests age-technology shows the largest entropy reduction while gender-occupation shows the largest directional shift for OLMo, but note these are pilot-scale observations (n = 300 per cell) on a single prompt panel.

Domain	Family	delta-bias_score	bias 95% CI	delta-entropy	entropy 95% CI	n
Gender-occupation	OLMo	-0.129	[-0.152, -0.107]	-0.063	[-0.079, -0.048]	300
Gender-occupation	Tulu 3	-0.006	[-0.016, +0.004]	-0.028	[-0.034, -0.021]	300
Race-crime	OLMo	-0.085	[-0.125, -0.045]	-0.041	[-0.057, -0.026]	300
Race-crime	Tulu 3	+0.056	[+0.037, +0.075]	-0.034	[-0.044, -0.025]	300
Age-technology	OLMo	-0.106	[-0.149, -0.066]	-0.090	[-0.108, -0.072]	300
Age-technology	Tulu 3	+0.005	[-0.018, +0.027]	-0.059	[-0.073, -0.044]	300

Cross-family pattern: OLMo shows consistently negative bias-score deltas across all social domains (toward less stereotyped association after DPO). Tulu 3 shows near-zero or mildly positive effects, with the race-crime domain being the only case where Tulu 3 shows a statistically significant positive shift (+0.056). Entropy deltas are consistently negative across both families and all domains, indicating distributional concentration (sharpening) as a general effect of the training pipeline. The magnitude of entropy reduction is largest in age-technology for both families (OLMo: -0.090; Tulu 3: -0.059).

Decision relevance

What this study shows

Original prereg v1.0 remains Contradicted (immutable) due to F3 neutral-control violation
Under prospective v2.0 adjudication with frozen neutral controls and independent-control mapping, results resolve to R1 (artifact-supported) on this dataset
BBQ-scoped corrective replication (V3.0) provided bounded support for R1 under conservative governance framing
The F3 episode demonstrates that token-pair neutral-control metrics are sensitive to function-word frequency shifts
Distribution-level metrics (entropy delta, JSD ratio) are more robust than single token-pair comparisons for neutral-control assessment
The falsification-first protocol with immutability preservation is a reusable methodological template

What this study does NOT show

No universal RLHF-bias mechanism claim: findings are scoped to the tested datasets and model families only
No broad external-validity claim: BBQ corrective results do not extend beyond that dataset
No causal pathway established: the study characterizes distributional shifts, not their underlying mechanism
No policy recommendation: provisional findings do not justify mitigation policy changes without broader replication
No generation-level validation: token-level probability comparisons may miss bias expressed through word choice, framing, or refusal patterns

Limitations

Immutable v1.0 contradiction

The original pre-registered analysis yielded a Contradicted verdict. All subsequent analyses (v1.2, v2.0, V3.0) are additive; they do not erase or overwrite this outcome. The contradiction is preserved in the permanent record.

Token-pair measurement sensitivity

The original neutral-control check used an arbitrary function-word pair (" the" / " a") as target tokens. While diagnostic investigation attributed the F3 firing to this choice, the study cannot definitively rule out that the neutral domain was partially confounded by model-specific token frequency distributions. The amended distributional metrics (entropy delta, JSD ratio) are more robust but remain dataset-specific.

Dataset scope

All findings are restricted to:

The custom 400-prompt pilot panel (3 social domains + 1 neutral control)
BBQ corrective cells (V3.0)
Two model families at 7-8B scale (OLMo 2, Tulu 3)

No claims are made about generalizability to other datasets, domains, model scales, or preference-tuning methods beyond DPO.

Pronoun-based bias measurement

The bias measurement relies on token-level probability comparisons (pronoun/association tokens). Models may express bias through word choice, framing, refusal patterns, or generation-level semantics not captured by this metric. This is an acknowledged limitation of the pilot design.

Generic confidence increase

Fine-tuning generically increases model confidence. The neutral-control design partially addresses whether observed shifts are bias-specific versus global confidence effects, but this disentanglement is not complete. The amended distributional controls improve but do not fully resolve this concern.

Bootstrap sample size

Local memory constraints required reducing bootstrap resamples from B = 10,000 to B = 1,000. Convergence diagnostics (supplement S9) show CI widths stabilize with mean relative change of 0.8% from B = 750 to B = 1,000, indicating B = 1,000 is adequate for this pilot.

BBQ corrective scope

V3.0 Stage-2/3 results are explicitly scoped to BBQ cells only. No broader external-validity claims are made from the corrective replacement.

No universal mechanism claims

Even under R1 resolution with BBQ-scoped corrective support across all three V3 stages, this study does not establish a universal mechanism for how preference tuning affects bias. Findings suggest artifact-sensitivity of the original neutral-control metric on these datasets; broader claims require independent replication on additional datasets, model families, and scale tiers.

Discussion

Summary of findings

This pre-registered pilot study tested whether preference-tuning bias shifts are predominantly explained by distributional sharpening (H1). The original analysis contradicted this hypothesis due to a neutral-control violation. Subsequent investigation attributed the violation to metric sensitivity (specifically, the choice of function-word target tokens in the neutral control) rather than genuine distributional bias introduction.

Prospective re-adjudication (v2.0) resolved to R1 (artifact-supported), and a BBQ-scoped corrective replication (V3.0) provided BBQ-scoped corrective support for R1, suggesting the original F3 trigger was sensitive to neutral-control metric choice. However, the v1.0 contradiction is preserved as immutable, and no hypothesis status has been upgraded to "supported" or "confirmed." All corrective Stage-2/3 conclusions are drawn exclusively from the replacement provenance artifacts and apply only to the BBQ dataset.

Implications for bias mitigation

The artifact-sensitivity finding has provisional implications: if preference tuning primarily sharpens pre-existing associations (as would be consistent with R1 under corrected controls), mitigation effort may be more productively directed at pre-training data curation rather than post-training objective design. However, this interpretation remains provisional and dataset-scoped (custom pilot panel and BBQ only); it does not justify policy changes without broader replication across additional benchmarks and model families.

Methodological contributions

This study demonstrates a falsification-first protocol for bias decomposition research:

Pre-registered hard triggers (F1-F5) with automated scoring
Transparent amendment history with immutability preservation
Multi-phase adjudication (v1.0 to v1.2 to v2.0 to V3.0) with explicit scope labeling

The F3 episode illustrates the value of neutral controls and the risk of metric-specific artifacts in bias measurement.

Limitations and future work

Key open questions:

Does R1 replicate on additional bias benchmarks (StereoSet, CrowS-Pairs, BOLD)?
Does the decomposition pattern hold at different model scales?
Can generation-level bias metrics (beyond token-level probabilities) support or challenge these findings?
Do other preference-tuning methods (PPO, KTO, constitutional AI) show similar decomposition profiles?

These questions are deferred to a planned V4 extension.

Conclusion

This pre-registered pilot found the sharpening-dominant hypothesis (H1) contradicted under the original v1.0 protocol due to a neutral-control violation, an outcome preserved as immutable. Diagnostic investigation and prospective re-adjudication (v2.0, V3.0) attributed the violation to token-pair metric sensitivity and resolved to R1 (artifact-supported) on the tested datasets, with BBQ-scoped corrective support across all three V3 stages. These findings are consistent with preference tuning sharpening pre-existing distributional associations rather than introducing novel bias pathways, but this interpretation is provisional, applies only to the custom pilot panel and BBQ benchmark at 7-8B scale (OLMo 2, Tulu 3), and does not establish a universal mechanism. The study contributes a falsification-first protocol for bias decomposition and demonstrates the sensitivity of token-level neutral controls to function-word measurement artifacts.

Integrity statement

All governance artifacts (frozen baselines, SHA256 provenance chains, alias audits, invalidation notices, and amendment histories) are preserved in the supplement (S1-S4) and the project repository. Zero freeze breaches, zero alias regressions, and zero mid-cycle metric changes were detected across all phases. The v1.0 contradiction is verified as preserved in v1.2, v2.0, and V3.0 reporting.

Pilot study: Distributional bias shifts across preference-tuning stages on BBQ and custom prompt panels