Pilot study: Distributional bias shifts across preference-tuning stages on BBQ and custom prompt panels

Pre-registered pilot with falsification-first protocol for RLHF bias decomposition

Abstract

This study asks a practical question: when a model moves from base to SFT to DPO, are bias shifts mostly an amplification of patterns that were already there, or are new patterns introduced late in training? We tested two model families (OLMo 2 7B and Tulu 3 8B) on a custom 400-prompt panel plus a BBQ corrective replication (72 true-inference cells). In the original pilot, one pre-declared failure rule was triggered by the neutral-control metric (effect = -0.182, 95% CI [-0.205, -0.161]).F3 is a pre-declared hard trigger: if the neutral-control domain shows |effect| > 0.10 with CI excluding zero, the pilot fails regardless of social-domain results. Follow-up debugging showed this was likely caused by a measurement artifact from using function words ("the" vs "a") as targets,The tokens "the" and "a" are high-frequency function words; their probability ratio is sensitive to generic distributional shifts unrelated to social bias. not clear evidence of genuine bias introduction. After replacing that fragile control with stronger checks and running a bounded BBQ corrective replication,BBQ (Bias Benchmark for QA) by Parrish et al., 2022, tests social bias across nine categories with ambiguous and disambiguated question contexts. results were consistent with the artifact explanation. We keep the original pilot failure visible in the record, and we scope all conclusions to these tested datasets and models only.

Scope caveat. All findings are restricted to the custom 400-prompt pilot panel and BBQ corrective cells, tested on two model families (OLMo 2 7B, Tulu 3 8B). No universal RLHF-bias mechanism claim and no broad external-validity claim beyond tested scope.

Why this matters

Teams often spend significant effort on alignment changes without clear evidence about which training stage is actually moving bias metrics. This page is meant to make that legible: what we tested, what failed, what was fixed, and what conclusions are justified versus still uncertain.

At a glance

Model families tested
2
OLMo 2 7B, Tulu 3 8B
Hypothesis verdict
H1 Contradicted
v1.0 immutable; R1 artifact-supported
Falsification triggers
1 / 5 fired
F3 neutral-control violation
Corrective scope
BBQ only
72 true-inference cells
Finding 1: v1.0 sharpening hypothesis contradicted (immutable)
In the original run, one hard failure rule fired in the neutral-control slice (effect = -0.182, 95% CI [-0.205, -0.161]). We keep that failed result visible in the record, then treat later reruns as additional evidence rather than rewriting history.
Finding 2: Token-pair artifact identified; distributional controls clear
The failure appears tied to a fragile metric choice: the neutral control used function words ("the" vs "a"), which are sensitive to generic probability shifts. When we switched to distribution-level checks, neutral behavior no longer looked anomalous (entropy delta |0.041| < 0.10; JSD ratio 0.848 < 2.0).
Finding 3: BBQ corrective replication is consistent with R1 within dataset scope
A corrective BBQ rerun (72 true-inference cells) produced results consistent with the artifact explanation. These corrective conclusions are intentionally limited to BBQ and tied to replacement run artifacts with full provenance.

What we did

We measured stage-by-stage changes (base → SFT → DPO), used pre-declared quality/failure checks, and then reran the invalidated parts using true inference on BBQ so final claims are tied to verifiable replacement artifacts.

What we found

The first pilot result failed a neutral-control check. Follow-up analysis indicates that failure was likely driven by a fragile metric choice, and the corrective BBQ rerun supports that interpretation within the tested scope.

What this does NOT show

This does not prove a universal RLHF-bias mechanism, and it does not claim broad external validity beyond the datasets/models tested here.

How to use this

Use this page as: (1) a worked example of stage-by-stage bias decomposition, and (2) a scoped empirical result for these models/datasets. Treat broader claims as hypotheses that still need independent replication.

Introduction

Motivation

Preference tuning methods such as RLHF and DPO have become standard practice for aligning large language models (LLMs) with human preferences. While these methods improve helpfulness and safety, their effect on distributional social biases remains underspecified. Prior work has documented bias in both pre-trained and aligned models, but the relative contribution of each training stage (base pre-training, supervised fine-tuning (SFT), and preference optimization) to observed bias shifts has not been systematically decomposed.

Understanding this decomposition has practical implications for mitigation strategy: if bias shifts are predominantly explained by sharpening of pre-existing associations, mitigation should focus on pre-training data curation; if non-sharpening pathways (reward model bias, label artifacts, objective-specific restructuring) contribute materially, post-training interventions become more critical.

Research questions

  1. Primary: What is the relative contribution of the base-to-SFT transition versus the SFT-to-preference-tuned transition to observed distributional concentration and directional bias shifts across social domains?
  2. Secondary: Do non-sharpening pathways explain a material component of observed shift? Does the relative contribution differ by model family?

Hypotheses

ID Statement Pre-registered status
H1 Post-tuning bias shift is predominantly explained by concentration/sharpening of pre-existing associations. Contradicted (v1.0, immutable)
H2 Non-sharpening pathways explain a material component of shift. Preliminary signal
H3 Relative contribution differs by model family and scale tier. Preliminary signal

Scope statement

All findings reported here are scoped to the specific datasets (custom 400-prompt panel; BBQ corrective cells) and model families (OLMo 2, Tulu 3) tested. This is a pilot study and does not establish universal mechanisms of preference-tuning bias. Corrective Stage-2/3 conclusions from V3.0 apply exclusively to BBQ-scoped cells and do not extend beyond that dataset.

RLHF and bias behavior. Preference tuning with RLHF and DPO has been shown to shift model behavior along social dimensions, though the mechanism remains debated. Ouyang et al. (2022) documented that InstructGPT reduced some toxicity benchmarks while noting that bias effects were less consistent. Bai et al. (2022) observed that RLHF-trained models could exhibit different bias profiles than their base counterparts. Gallegos et al. (2024) surveyed bias in LLMs, noting that alignment procedures may redistribute rather than eliminate biases. The present study contributes a stage-wise decomposition perspective: rather than comparing base to final aligned model, we measure the incremental effect at each training transition.

Bias benchmarks. Our pilot panel covers three social domains (gender-occupation, race-crime, age-technology). The corrective replication uses BBQ (Parrish et al., 2022), a question-answering benchmark that tests social bias across nine categories with ambiguous and disambiguated contexts. Related benchmarks include StereoSet (Nadeem et al., 2021), which measures stereotypical associations through intrasentence and intersentence tasks, and CrowS-Pairs (Nangia et al., 2020), which provides paired sentences testing stereotypical and anti-stereotypical associations. Our token-level approach complements these generation-level benchmarks.

Token-level versus generation-level measurement. A key methodological tension in bias measurement is whether token-level probability comparisons (as used here) adequately capture model bias. Token-level metrics are efficient and interpretable but may miss bias expressed through word choice, framing, or refusal patterns (Blodgett et al., 2020). Generation-level metrics capture richer behavioral patterns but introduce confounds from decoding strategy and prompt sensitivity. Our pilot uses token-level probabilities as a tractable starting point, with the F3 neutral-control episode illustrating precisely the sensitivity of such metrics to target-token choice. This tension motivates our recommendation for generation-level replication in future work (see Limitations).

Methods

Pre-registration and amendment history

We ran this work in four phases. Key rule: we do not hide failed results. Later phases can add evidence, but they cannot erase what happened in earlier phases.

Version Scope Key change Status
v1.0 Original pilot 5 hard falsification triggers (F1-F5); neutral control via bias_score(the/a) Contradicted (immutable)
v1.2 Amended re-score (exploratory) F3 replaced with distributional metrics (entropy delta + JSD ratio); post hoc, additive only F3-amended CLEAR
v2.0 Prospective re-adjudication Predeclared Control B gate mapping; distribution-level neutral control frozen before re-scoring R1 (artifact-supported)
V3.0 External-validity replication (BBQ-only corrective) 72 true-inference cells on BBQ under frozen governance baseline R1 (BBQ-scoped corrective support)

Practically: v1.0 failed one rule, and that stays on the record. v1.2, v2.0, and V3.0 are follow-up analyses that add context; they are not retroactive rewrites.

Notation and gate terms

R1 (artifact-supported). Resolution outcome indicating that the original v1.0 F3 contradiction was likely caused by neutral-control metric contamination rather than genuine distributional bias introduction. Declared when all three conditions hold: (1) the v2.0 distribution-level neutral control is CLEAR, (2) an independent neutral-control design (Control B) is CLEAR, and (3) social-domain effects remain present. R1 does not erase the v1.0 contradiction; it provides an interpretive frame for subsequent analyses.

Control B (representation-invariance gate). An independent neutral-control check predeclared in v2.0. For each aligned counter-pair (p, p_rev) on neutral prompts, define pair-level deltas d_p = mean(score_DPO - score_base) and d_p_rev for the reversed ordering. Control B passes (CLEAR) if and only if all three criteria are met:

  1. Parse success: all neutral prompts parse without error (threshold: 1.0).
  2. Coverage: per family x seed coverage >= 0.80 for each counter-pair.
  3. Symmetry residual: |d_p + d_p_rev| <= 0.02 (conservative tolerance for aggregation noise; not outcome-optimized).

Control B FIRED if any condition fails. This evaluates representation invariance (counterbalanced reversal consistency) rather than forcing a specific magnitude target.

Other gate terms: CLEAR = control passes all specified criteria; FIRED = control fails one or more criteria.

Models

We tested two open-weight model families at roughly the same size (7-8B):

Family Base SFT DPO/Preference
OLMo 2 allenai/OLMo-2-1124-7B (rev 7df9a825) allenai/OLMo-2-1124-7B-SFT (rev 1de02c01) allenai/OLMo-2-1124-7B-DPO (rev e34ea60a)
Tulu 3 Tulu 3 base Tulu 3 SFT Tulu 3 DPO

To keep runs comparable, model revisions were pinned at freeze time and naming was locked to olmo and tulu3.

Datasets

Pilot panel (v1.0-v2.0):

BBQ corrective (V3.0):

Primary endpoints

Jensen-Shannon Divergence (JSD): how much output distributions changed between stages for a prompt.

Directional bias score: bias_score = P(stereotyped_token) - P(counter_stereotyped_token). Positive values mean movement toward the stereotyped association. We compare deltas stage-to-stage.

Contribution decomposition: we split the total base→DPO shift into stage contributions (base→SFT vs SFT→DPO). If total shift is too small, we explicitly mark the split as non-identifiable instead of over-interpreting noise.The epsilon guard prevents spurious decomposition claims when total shift is too small: below 0.01 JSD or 0.05 bias units, share attribution is marked non-identifiable.

Mechanism operationalization

We used simple decision rules to separate two possibilities: (a) later training mostly amplifies existing patterns (“sharpening”), or (b) later training introduces meaningfully new bias behavior.

Sharpening (H1-consistent): later tuning strengthens patterns that were already present. We call it sharpening when:

Non-sharpening (H2-consistent): later tuning appears to add behavior not explained by earlier patterns. We call it non-sharpening when:

Decision boundaries for H1/H2 adjudication:

Criterion Threshold Metric H1-consistent H2-consistent
Neutral-control bias |effect| > 0.10, CI excludes 0 bias_score delta Below threshold Above threshold
Neutral entropy disturbance |delta-entropy| >= 0.10 Entropy delta (DPO - base) Below threshold At or above threshold
Neutral JSD ratio JSD_neutral / mean(JSD_social) >= 2.0 JSD ratio Below threshold At or above threshold
Confound dominance Stage coefficient attenuation >= 50% OLS with covariates Below threshold At or above threshold
Cross-family replication Sign disagreement or both CIs span zero Directional consistency Consistent signs, CIs exclude zero Inconsistent or null

These thresholds were set before final adjudication and then carried forward. If total shift is too small, we label the decomposition non-identifiable rather than forcing a misleading stage split.

Statistical framework

Falsification framework (F1-F5)

We used five hard failure checks. If any one fired, the pilot was marked as failed for that cycle:

Trigger Definition v1.0 result
F1 Prompt panel hash + decode config hash mismatch across stages CLEAR
F2 Cross-family sign flip in primary decode direction CLEAR
F3 Neutral-control directional bias: |effect| > 0.10 AND 95% CI excludes 0 FIRED
F4 Confound dominance: stage coefficient attenuation >= 50% when covariates added CLEAR (-21.5%)
F5 Cross-family replication failure: sign disagreement or both CIs span zero CLEAR

Confound audit

Before making mechanism claims, we checked common confounds:

Artifact invalidation and replacement (BBQ-only scope)

During V3 execution, the original Stage-2 and Stage-3 scientific adjudication packets were found to contain placeholder launcher outputs rather than true GPU inference results. These artifacts were invalidated on 2026-02-21 and replaced by true Modal inference (A10G GPU) outputs with full SHA256 provenance. The replacement scope covers BBQ corrective cells only. Invalidated artifacts are preserved in the repository as historical record and are not cited as primary evidence. Full details are in the supplement (S2).

Results

Original pre-registered pilot (v1.0): contradicted

Overall verdict: Contradicted due to F3 neutral-control violation.

F3 fired with mean neutral-domain bias delta (DPO - base) = -0.182, 95% CI [-0.205, -0.161]. Both families showed consistent negative direction (Tulu 3: -0.172, n = 300; OLMo: -0.193, n = 300). The training pipeline introduced directional bias on prompts designed to carry no social content, violating the sharpening-only prediction.

All other falsification checks passed (F1, F2, F4, F5 CLEAR). Confound audit: CLEAN. Cross-family replication (F5): both families showed same-sign effects across all three seeds with CIs excluding zero.

This result is immutable and preserved across all subsequent analyses.

Neutral-control investigation: token-pair artifact and scientific record

The v1.0 neutral control measured bias via P("the") - P("a") on 100 synthetic prompts. The F3 trigger fired, contradicting H1. Subsequent diagnostic investigation revealed that this metric was conceptually unsuited for the intended construct: the tokens "the" and "a" are high-frequency function words with no bias-relevant meaning, and the base model already exhibited nonzero mean bias_score (+0.176) on these prompts. The DPO shift moved this arbitrary baseline toward zero, which the directional metric registered as a large effect.

Distribution-level diagnostics showed the neutral domain exhibited no anomalous disturbance:

Why the v1.0 result is preserved. The original neutral-control design was part of the pre-registered protocol, and its outcome is immutable regardless of subsequent diagnostic findings. Preserving contradicted results is standard scientific practice and serves as an anti-HARKing safeguard: if only post-hoc-favorable outcomes were retained, the research record would be systematically biased toward confirmation. The v1.0 contradiction thus serves a dual function: (1) it is a genuine finding about metric sensitivity in token-level bias measurement, and (2) it demonstrates the study's commitment to pre-registration integrity. All subsequent analyses (v1.2, v2.0, V3.0) are explicitly labeled as additive and do not modify the original verdict.

Figure 3 (below) visualizes the neutral-control artifact: the token-pair metric shows large effects while distribution-level metrics show the neutral domain behaving comparably to social domains.

Prospective v2.0 re-adjudication: R1 (artifact-supported)

Under prospective adjudication with frozen distributional neutral controls and explicit independent Control B gate mapping (see Notation and gate terms for definitions):

Resolution: R1 (artifact-supported).

V3 BBQ corrective replication: R1 (BBQ-scoped corrective support)

True-inference replication on BBQ under frozen governance baseline:

Stage Scope Cells Evidence source
Stage-1 v2 existing replication provenance-closed stage1_scientific_adjudication_packet.json (valid)
Stage-2 BBQ corrective 18/18 complete stage2_modal_true_inference_provenance.json (replacement)
Stage-3 BBQ corrective 54/54 complete stage3_modal_true_inference_provenance.json (replacement)

Cross-stage synthesis (BBQ scope only): R1 within BBQ-scoped cells across all three stages. Evidence for Stage-2/3 corrective conclusions is drawn exclusively from the replacement closure and provenance artifacts (see Artifact invalidation and replacement and supplement S2).

Stage-wise bias distributions by domain

Figure 1: Stage-wise JSD and bias-score distributions across social domains and model families
Figure 1. Stage-wise JSD and bias-score distributions across social domains and model families. If designing stage-aware bias audits for multi-stage LLM pipelines, this suggests monitoring distributional shifts at each training transition, but note these distributions are specific to OLMo 2 and Tulu 3 at 7-8B scale under our 400-prompt panel.

Table 1. Mean bias-score delta (DPO - base) by domain and family (n = 300 per cell; 95% bootstrap CI). If prioritizing bias mitigation resources across domains, this suggests gender-occupation shows the largest cross-family effect, but note effects are dataset-specific and do not generalize beyond the tested prompt panel.

Domain Family Mean delta-bias_score 95% CI |Effect| CI excludes zero
Gender-occupation OLMo -0.129 [-0.152, -0.107] 0.129 Yes
Gender-occupation Tulu 3 -0.006 [-0.016, +0.004] 0.006 No
Gender-occupation Pooled -0.068 [-0.081, -0.056] 0.068 Yes
Race-crime OLMo -0.085 [-0.125, -0.045] 0.085 Yes
Race-crime Tulu 3 +0.056 [+0.037, +0.075] 0.056 Yes
Race-crime Pooled -0.015 [-0.038, +0.008] 0.015 No
Age-technology OLMo -0.106 [-0.149, -0.066] 0.106 Yes
Age-technology Tulu 3 +0.005 [-0.018, +0.027] 0.005 No
Age-technology Pooled -0.050 [-0.075, -0.026] 0.050 Yes

Table 2. Mean JSD by domain and stage transition (pooled across families and seeds; source: results/jsd_results.csv). If evaluating which training transition introduces the most distributional change, this suggests the SFT-to-DPO step dominates for gender-occupation while base-to-SFT dominates for other domains, but note JSD magnitude depends on the specific token targets measured.

Domain JSD (base-to-SFT) JSD (SFT-to-DPO) JSD (base-to-DPO)
Gender-occupation 0.011 0.021 0.034
Race-crime 0.009 0.002 0.013
Age-technology 0.012 0.002 0.017
Neutral 0.010 0.003 0.018

Gender-occupation shows the largest cumulative JSD shift (0.034), driven primarily by the SFT-to-DPO transition (0.021, 62% of total). Race-crime and age-technology show smaller cumulative shifts (0.013, 0.017) with the base-to-SFT transition contributing the majority. The neutral domain (0.018) is comparable to social domains, consistent with the distribution-level findings above.

JSD decomposition across stages

Figure 2: Stage contribution decomposition: SFT share vs. DPO share of total base-to-DPO shift
Figure 2. Stage contribution decomposition: SFT share vs. DPO share of total base-to-DPO shift. If allocating mitigation effort between pre-training curation and post-training design, this suggests the relative contribution varies by domain and metric, but note decomposition shares are sensitive to epsilon-guard thresholds and non-identifiable cells.

Table 3. Contribution decomposition by domain, family, and metric (bootstrap point estimates with 95% CI). If interpreting stage-wise responsibility for bias shifts, this suggests neither stage universally dominates, but note Tulu 3 gender-occupation and age-technology are non-identifiable due to small total shift magnitude.

Domain Family Metric SFT share SFT 95% CI DPO share DPO 95% CI Identifiable
Gender-occ OLMo bias_score 0.34 [0.20, 0.43] 0.66 [0.57, 0.80] Yes
Gender-occ OLMo entropy 0.49 [0.21, 0.72] 0.51 [0.28, 0.79] Yes
Gender-occ Tulu 3 bias_score - - - - No (70% non-id)
Gender-occ Tulu 3 entropy 0.37 [0.09, 0.61] 0.63 [0.39, 0.91] Yes
Race-crime OLMo bias_score -0.14 [-0.74, 0.25] 1.14 [0.75, 1.74] Yes
Race-crime OLMo entropy 0.16 [-0.32, 0.43] 0.84 [0.57, 1.32] Yes
Race-crime Tulu 3 bias_score 1.23 [1.01, 1.48] -0.23 [-0.48, -0.01] Yes
Race-crime Tulu 3 entropy 0.56 [0.38, 0.69] 0.44 [0.31, 0.62] Yes
Age-tech OLMo bias_score 0.78 [0.67, 0.88] 0.22 [0.12, 0.33] Yes
Age-tech OLMo entropy 0.63 [0.53, 0.71] 0.37 [0.29, 0.47] Yes
Age-tech Tulu 3 bias_score - - - - No
Age-tech Tulu 3 entropy 0.29 [0.01, 0.48] 0.71 [0.52, 0.99] Yes
Pooled All bias_score 0.19 [0.06, 0.30] 0.81 [0.70, 0.94] Yes
Pooled All entropy 0.43 [0.35, 0.50] 0.57 [0.50, 0.65] Yes

Neutral-control artifact visualization

Figure 3: Neutral-control artifact: token-pair metric vs. distribution-level metrics
Figure 3. Neutral-control artifact: token-pair metric vs. distribution-level metrics. The token-pair metric (bias_score) shows large effects on the neutral domain while entropy and JSD metrics show the neutral domain behaving comparably to social domains. If designing neutral controls for token-level bias studies, this suggests distribution-level metrics (entropy, JSD) are more robust than single token-pair comparisons, but note this finding is specific to the "the"/"a" pair tested.

Table 5. Neutral-domain metrics comparison: token-pair (bias_score) vs. distributional (entropy, JSD). If selecting neutral-control metrics for bias measurement, this suggests token-pair metrics may register artifacts from function-word frequency shifts, but note distributional metrics have their own sensitivity to vocabulary coverage.

Metric Family Value Threshold Status
bias_score delta Tulu 3 -0.172 |effect| > 0.10 FIRED
bias_score delta OLMo -0.193 |effect| > 0.10 FIRED
bias_score delta Pooled -0.182 |effect| > 0.10 FIRED
Entropy delta Tulu 3 -0.041 |delta| >= 0.10 CLEAR
Entropy delta OLMo -0.042 |delta| >= 0.10 CLEAR
Entropy delta Pooled -0.041 |delta| >= 0.10 CLEAR
JSD ratio Pooled 0.848 ratio >= 2.0 CLEAR

Table 6. Cross-domain entropy delta comparison (pooled). If assessing whether the neutral domain behaves anomalously relative to social domains, this suggests the neutral entropy delta falls within the range observed for social domains, but note entropy is a coarse measure of distributional change.

Domain Mean delta-entropy 95% CI |Effect|
Neutral -0.041 [-0.054, -0.029] 0.041
Gender-occupation -0.045 [-0.055, -0.037] 0.045
Race-crime -0.038 [-0.047, -0.028] 0.038
Age-technology -0.074 [-0.086, -0.063] 0.074

Claim block

Original prereg v1.0 remains Contradicted (immutable). Under prospective v2.0 adjudication with frozen neutral controls and explicit independent-control mapping, results resolve to R1 (artifact-supported) on this dataset. A BBQ-scoped corrective replication (V3.0) provided BBQ-scoped corrective support for R1 under conservative governance framing, suggesting the original F3 trigger was sensitive to neutral-control metric choice. This does not erase v1.0 history, does not constitute independent confirmation of a universal mechanism, and remains scoped to the tested datasets and model families.

Plain-language summary of claims

In plain terms: we set out to test whether the bias patterns observed after preference tuning were mostly an amplification of biases already present in the base model. Our initial analysis said "no" (the neutral control showed unexpected bias, which should not happen if only pre-existing biases were being amplified). However, we then discovered that the specific way we measured the neutral control (comparing probabilities of "the" versus "a") was picking up an irrelevant signal rather than genuine bias introduction. When we used broader distributional measures instead, the neutral control behaved normally. A follow-up replication on the BBQ benchmark supported this interpretation. The bottom line is: on our specific test datasets and models, the evidence is consistent with sharpening of pre-existing patterns rather than creation of new biases, but this conclusion is provisional, applies only to what we tested, and the original "no" result is permanently recorded as part of the scientific record.

Social-domain effect sizes

Table 4. Social-domain effect sizes: bias-score and entropy deltas by domain and family. If comparing bias shift magnitude across social domains to prioritize auditing effort, this suggests age-technology shows the largest entropy reduction while gender-occupation shows the largest directional shift for OLMo, but note these are pilot-scale observations (n = 300 per cell) on a single prompt panel.

Domain Family delta-bias_score bias 95% CI delta-entropy entropy 95% CI n
Gender-occupation OLMo -0.129 [-0.152, -0.107] -0.063 [-0.079, -0.048] 300
Gender-occupation Tulu 3 -0.006 [-0.016, +0.004] -0.028 [-0.034, -0.021] 300
Race-crime OLMo -0.085 [-0.125, -0.045] -0.041 [-0.057, -0.026] 300
Race-crime Tulu 3 +0.056 [+0.037, +0.075] -0.034 [-0.044, -0.025] 300
Age-technology OLMo -0.106 [-0.149, -0.066] -0.090 [-0.108, -0.072] 300
Age-technology Tulu 3 +0.005 [-0.018, +0.027] -0.059 [-0.073, -0.044] 300

Cross-family pattern: OLMo shows consistently negative bias-score deltas across all social domains (toward less stereotyped association after DPO). Tulu 3 shows near-zero or mildly positive effects, with the race-crime domain being the only case where Tulu 3 shows a statistically significant positive shift (+0.056). Entropy deltas are consistently negative across both families and all domains, indicating distributional concentration (sharpening) as a general effect of the training pipeline. The magnitude of entropy reduction is largest in age-technology for both families (OLMo: -0.090; Tulu 3: -0.059).

Decision relevance

What this study shows

What this study does NOT show

Limitations

Immutable v1.0 contradiction

The original pre-registered analysis yielded a Contradicted verdict. All subsequent analyses (v1.2, v2.0, V3.0) are additive; they do not erase or overwrite this outcome. The contradiction is preserved in the permanent record.

Token-pair measurement sensitivity

The original neutral-control check used an arbitrary function-word pair (" the" / " a") as target tokens. While diagnostic investigation attributed the F3 firing to this choice, the study cannot definitively rule out that the neutral domain was partially confounded by model-specific token frequency distributions. The amended distributional metrics (entropy delta, JSD ratio) are more robust but remain dataset-specific.

Dataset scope

All findings are restricted to:

No claims are made about generalizability to other datasets, domains, model scales, or preference-tuning methods beyond DPO.

Pronoun-based bias measurement

The bias measurement relies on token-level probability comparisons (pronoun/association tokens). Models may express bias through word choice, framing, refusal patterns, or generation-level semantics not captured by this metric. This is an acknowledged limitation of the pilot design.

Generic confidence increase

Fine-tuning generically increases model confidence. The neutral-control design partially addresses whether observed shifts are bias-specific versus global confidence effects, but this disentanglement is not complete. The amended distributional controls improve but do not fully resolve this concern.

Bootstrap sample size

Local memory constraints required reducing bootstrap resamples from B = 10,000 to B = 1,000. Convergence diagnostics (supplement S9) show CI widths stabilize with mean relative change of 0.8% from B = 750 to B = 1,000, indicating B = 1,000 is adequate for this pilot.

BBQ corrective scope

V3.0 Stage-2/3 results are explicitly scoped to BBQ cells only. No broader external-validity claims are made from the corrective replacement.

No universal mechanism claims

Even under R1 resolution with BBQ-scoped corrective support across all three V3 stages, this study does not establish a universal mechanism for how preference tuning affects bias. Findings suggest artifact-sensitivity of the original neutral-control metric on these datasets; broader claims require independent replication on additional datasets, model families, and scale tiers.

Discussion

Summary of findings

This pre-registered pilot study tested whether preference-tuning bias shifts are predominantly explained by distributional sharpening (H1). The original analysis contradicted this hypothesis due to a neutral-control violation. Subsequent investigation attributed the violation to metric sensitivity (specifically, the choice of function-word target tokens in the neutral control) rather than genuine distributional bias introduction.

Prospective re-adjudication (v2.0) resolved to R1 (artifact-supported), and a BBQ-scoped corrective replication (V3.0) provided BBQ-scoped corrective support for R1, suggesting the original F3 trigger was sensitive to neutral-control metric choice. However, the v1.0 contradiction is preserved as immutable, and no hypothesis status has been upgraded to "supported" or "confirmed." All corrective Stage-2/3 conclusions are drawn exclusively from the replacement provenance artifacts and apply only to the BBQ dataset.

Implications for bias mitigation

The artifact-sensitivity finding has provisional implications: if preference tuning primarily sharpens pre-existing associations (as would be consistent with R1 under corrected controls), mitigation effort may be more productively directed at pre-training data curation rather than post-training objective design. However, this interpretation remains provisional and dataset-scoped (custom pilot panel and BBQ only); it does not justify policy changes without broader replication across additional benchmarks and model families.

Methodological contributions

This study demonstrates a falsification-first protocol for bias decomposition research:

The F3 episode illustrates the value of neutral controls and the risk of metric-specific artifacts in bias measurement.

Limitations and future work

Key open questions:

  1. Does R1 replicate on additional bias benchmarks (StereoSet, CrowS-Pairs, BOLD)?
  2. Does the decomposition pattern hold at different model scales?
  3. Can generation-level bias metrics (beyond token-level probabilities) support or challenge these findings?
  4. Do other preference-tuning methods (PPO, KTO, constitutional AI) show similar decomposition profiles?

These questions are deferred to a planned V4 extension.

Conclusion

This pre-registered pilot found the sharpening-dominant hypothesis (H1) contradicted under the original v1.0 protocol due to a neutral-control violation, an outcome preserved as immutable. Diagnostic investigation and prospective re-adjudication (v2.0, V3.0) attributed the violation to token-pair metric sensitivity and resolved to R1 (artifact-supported) on the tested datasets, with BBQ-scoped corrective support across all three V3 stages. These findings are consistent with preference tuning sharpening pre-existing distributional associations rather than introducing novel bias pathways, but this interpretation is provisional, applies only to the custom pilot panel and BBQ benchmark at 7-8B scale (OLMo 2, Tulu 3), and does not establish a universal mechanism. The study contributes a falsification-first protocol for bias decomposition and demonstrates the sensitivity of token-level neutral controls to function-word measurement artifacts.

Integrity statement

All governance artifacts (frozen baselines, SHA256 provenance chains, alias audits, invalidation notices, and amendment histories) are preserved in the supplement (S1-S4) and the project repository. Zero freeze breaches, zero alias regressions, and zero mid-cycle metric changes were detected across all phases. The v1.0 contradiction is verified as preserved in v1.2, v2.0, and V3.0 reporting.

Acknowledgments

This work was conducted as an independent research study. V3 BBQ corrective inference was executed on Modal A10G GPUs. We thank the open-source teams behind OLMo, Tulu, and BBQ.

Reproducibility

Code, frozen artifacts, and configuration files are available at: https://github.com/Sohailm25/rlhf-entropy-pilot
Direct paths: paper/, figures/, results/

Reproduction details:

References

Bai, Y., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Blodgett, S. L., Barocas, S., Daume III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of bias in NLP. Proceedings of ACL.

Gallegos, I. O., et al. (2024). Bias and fairness in large language models: A survey. Computational Linguistics, 50(4).

Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models. Proceedings of ACL.

Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. Proceedings of EMNLP.

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in NeurIPS, 35.

Parrish, A., et al. (2022). BBQ: A hand-built bias benchmark for question answering. Findings of ACL.