Pre-registered pilot with falsification-first protocol for RLHF bias decomposition
This study asks a practical question: when a model moves from base to SFT to DPO, are bias shifts mostly an amplification of patterns that were already there, or are new patterns introduced late in training? We tested two model families (OLMo 2 7B and Tulu 3 8B) on a custom 400-prompt panel plus a BBQ corrective replication (72 true-inference cells). In the original pilot, one pre-declared failure rule was triggered by the neutral-control metric (effect = -0.182, 95% CI [-0.205, -0.161]).F3 is a pre-declared hard trigger: if the neutral-control domain shows |effect| > 0.10 with CI excluding zero, the pilot fails regardless of social-domain results. Follow-up debugging showed this was likely caused by a measurement artifact from using function words ("the" vs "a") as targets,The tokens "the" and "a" are high-frequency function words; their probability ratio is sensitive to generic distributional shifts unrelated to social bias. not clear evidence of genuine bias introduction. After replacing that fragile control with stronger checks and running a bounded BBQ corrective replication,BBQ (Bias Benchmark for QA) by Parrish et al., 2022, tests social bias across nine categories with ambiguous and disambiguated question contexts. results were consistent with the artifact explanation. We keep the original pilot failure visible in the record, and we scope all conclusions to these tested datasets and models only.
Teams often spend significant effort on alignment changes without clear evidence about which training stage is actually moving bias metrics. This page is meant to make that legible: what we tested, what failed, what was fixed, and what conclusions are justified versus still uncertain.
We measured stage-by-stage changes (base → SFT → DPO), used pre-declared quality/failure checks, and then reran the invalidated parts using true inference on BBQ so final claims are tied to verifiable replacement artifacts.
The first pilot result failed a neutral-control check. Follow-up analysis indicates that failure was likely driven by a fragile metric choice, and the corrective BBQ rerun supports that interpretation within the tested scope.
This does not prove a universal RLHF-bias mechanism, and it does not claim broad external validity beyond the datasets/models tested here.
Use this page as: (1) a worked example of stage-by-stage bias decomposition, and (2) a scoped empirical result for these models/datasets. Treat broader claims as hypotheses that still need independent replication.
Preference tuning methods such as RLHF and DPO have become standard practice for aligning large language models (LLMs) with human preferences. While these methods improve helpfulness and safety, their effect on distributional social biases remains underspecified. Prior work has documented bias in both pre-trained and aligned models, but the relative contribution of each training stage (base pre-training, supervised fine-tuning (SFT), and preference optimization) to observed bias shifts has not been systematically decomposed.
Understanding this decomposition has practical implications for mitigation strategy: if bias shifts are predominantly explained by sharpening of pre-existing associations, mitigation should focus on pre-training data curation; if non-sharpening pathways (reward model bias, label artifacts, objective-specific restructuring) contribute materially, post-training interventions become more critical.
| ID | Statement | Pre-registered status |
|---|---|---|
| H1 | Post-tuning bias shift is predominantly explained by concentration/sharpening of pre-existing associations. | Contradicted (v1.0, immutable) |
| H2 | Non-sharpening pathways explain a material component of shift. | Preliminary signal |
| H3 | Relative contribution differs by model family and scale tier. | Preliminary signal |
All findings reported here are scoped to the specific datasets (custom 400-prompt panel; BBQ corrective cells) and model families (OLMo 2, Tulu 3) tested. This is a pilot study and does not establish universal mechanisms of preference-tuning bias. Corrective Stage-2/3 conclusions from V3.0 apply exclusively to BBQ-scoped cells and do not extend beyond that dataset.
RLHF and bias behavior. Preference tuning with RLHF and DPO has been shown to shift model behavior along social dimensions, though the mechanism remains debated. Ouyang et al. (2022) documented that InstructGPT reduced some toxicity benchmarks while noting that bias effects were less consistent. Bai et al. (2022) observed that RLHF-trained models could exhibit different bias profiles than their base counterparts. Gallegos et al. (2024) surveyed bias in LLMs, noting that alignment procedures may redistribute rather than eliminate biases. The present study contributes a stage-wise decomposition perspective: rather than comparing base to final aligned model, we measure the incremental effect at each training transition.
Bias benchmarks. Our pilot panel covers three social domains (gender-occupation, race-crime, age-technology). The corrective replication uses BBQ (Parrish et al., 2022), a question-answering benchmark that tests social bias across nine categories with ambiguous and disambiguated contexts. Related benchmarks include StereoSet (Nadeem et al., 2021), which measures stereotypical associations through intrasentence and intersentence tasks, and CrowS-Pairs (Nangia et al., 2020), which provides paired sentences testing stereotypical and anti-stereotypical associations. Our token-level approach complements these generation-level benchmarks.
Token-level versus generation-level measurement. A key methodological tension in bias measurement is whether token-level probability comparisons (as used here) adequately capture model bias. Token-level metrics are efficient and interpretable but may miss bias expressed through word choice, framing, or refusal patterns (Blodgett et al., 2020). Generation-level metrics capture richer behavioral patterns but introduce confounds from decoding strategy and prompt sensitivity. Our pilot uses token-level probabilities as a tractable starting point, with the F3 neutral-control episode illustrating precisely the sensitivity of such metrics to target-token choice. This tension motivates our recommendation for generation-level replication in future work (see Limitations).
We ran this work in four phases. Key rule: we do not hide failed results. Later phases can add evidence, but they cannot erase what happened in earlier phases.
| Version | Scope | Key change | Status |
|---|---|---|---|
| v1.0 | Original pilot | 5 hard falsification triggers (F1-F5); neutral control via bias_score(the/a) |
Contradicted (immutable) |
| v1.2 | Amended re-score (exploratory) | F3 replaced with distributional metrics (entropy delta + JSD ratio); post hoc, additive only | F3-amended CLEAR |
| v2.0 | Prospective re-adjudication | Predeclared Control B gate mapping; distribution-level neutral control frozen before re-scoring | R1 (artifact-supported) |
| V3.0 | External-validity replication (BBQ-only corrective) | 72 true-inference cells on BBQ under frozen governance baseline | R1 (BBQ-scoped corrective support) |
Practically: v1.0 failed one rule, and that stays on the record. v1.2, v2.0, and V3.0 are follow-up analyses that add context; they are not retroactive rewrites.
R1 (artifact-supported). Resolution outcome indicating that the original v1.0 F3 contradiction was likely caused by neutral-control metric contamination rather than genuine distributional bias introduction. Declared when all three conditions hold: (1) the v2.0 distribution-level neutral control is CLEAR, (2) an independent neutral-control design (Control B) is CLEAR, and (3) social-domain effects remain present. R1 does not erase the v1.0 contradiction; it provides an interpretive frame for subsequent analyses.
Control B (representation-invariance gate). An independent neutral-control check predeclared in v2.0. For each aligned counter-pair (p, p_rev) on neutral prompts, define pair-level deltas d_p = mean(score_DPO - score_base) and d_p_rev for the reversed ordering. Control B passes (CLEAR) if and only if all three criteria are met:
- Parse success: all neutral prompts parse without error (threshold: 1.0).
- Coverage: per family x seed coverage >= 0.80 for each counter-pair.
- Symmetry residual: |d_p + d_p_rev| <= 0.02 (conservative tolerance for aggregation noise; not outcome-optimized).
Control B FIRED if any condition fails. This evaluates representation invariance (counterbalanced reversal consistency) rather than forcing a specific magnitude target.
Other gate terms: CLEAR = control passes all specified criteria; FIRED = control fails one or more criteria.
We tested two open-weight model families at roughly the same size (7-8B):
| Family | Base | SFT | DPO/Preference |
|---|---|---|---|
| OLMo 2 | allenai/OLMo-2-1124-7B (rev 7df9a825) |
allenai/OLMo-2-1124-7B-SFT (rev 1de02c01) |
allenai/OLMo-2-1124-7B-DPO (rev e34ea60a) |
| Tulu 3 | Tulu 3 base | Tulu 3 SFT | Tulu 3 DPO |
To keep runs comparable, model revisions were pinned at freeze time and naming was locked to olmo and tulu3.
Pilot panel (v1.0-v2.0):
" the" / " a")BBQ corrective (V3.0):
Jensen-Shannon Divergence (JSD): how much output distributions changed between stages for a prompt.
Directional bias score: bias_score = P(stereotyped_token) - P(counter_stereotyped_token). Positive values mean movement toward the stereotyped association. We compare deltas stage-to-stage.
Contribution decomposition: we split the total base→DPO shift into stage contributions (base→SFT vs SFT→DPO). If total shift is too small, we explicitly mark the split as non-identifiable instead of over-interpreting noise.The epsilon guard prevents spurious decomposition claims when total shift is too small: below 0.01 JSD or 0.05 bias units, share attribution is marked non-identifiable.
We used simple decision rules to separate two possibilities: (a) later training mostly amplifies existing patterns (“sharpening”), or (b) later training introduces meaningfully new bias behavior.
Sharpening (H1-consistent): later tuning strengthens patterns that were already present. We call it sharpening when:
Non-sharpening (H2-consistent): later tuning appears to add behavior not explained by earlier patterns. We call it non-sharpening when:
Decision boundaries for H1/H2 adjudication:
| Criterion | Threshold | Metric | H1-consistent | H2-consistent |
|---|---|---|---|---|
| Neutral-control bias | |effect| > 0.10, CI excludes 0 | bias_score delta |
Below threshold | Above threshold |
| Neutral entropy disturbance | |delta-entropy| >= 0.10 | Entropy delta (DPO - base) | Below threshold | At or above threshold |
| Neutral JSD ratio | JSD_neutral / mean(JSD_social) >= 2.0 | JSD ratio | Below threshold | At or above threshold |
| Confound dominance | Stage coefficient attenuation >= 50% | OLS with covariates | Below threshold | At or above threshold |
| Cross-family replication | Sign disagreement or both CIs span zero | Directional consistency | Consistent signs, CIs exclude zero | Inconsistent or null |
These thresholds were set before final adjudication and then carried forward. If total shift is too small, we label the decomposition non-identifiable rather than forcing a misleading stage split.
decomposition_share ~ domain + model_family (OLS on bootstrap samples)We used five hard failure checks. If any one fired, the pilot was marked as failed for that cycle:
| Trigger | Definition | v1.0 result |
|---|---|---|
| F1 | Prompt panel hash + decode config hash mismatch across stages | CLEAR |
| F2 | Cross-family sign flip in primary decode direction | CLEAR |
| F3 | Neutral-control directional bias: |effect| > 0.10 AND 95% CI excludes 0 | FIRED |
| F4 | Confound dominance: stage coefficient attenuation >= 50% when covariates added | CLEAR (-21.5%) |
| F5 | Cross-family replication failure: sign disagreement or both CIs span zero | CLEAR |
Before making mechanism claims, we checked common confounds:
During V3 execution, the original Stage-2 and Stage-3 scientific adjudication packets were found to contain placeholder launcher outputs rather than true GPU inference results. These artifacts were invalidated on 2026-02-21 and replaced by true Modal inference (A10G GPU) outputs with full SHA256 provenance. The replacement scope covers BBQ corrective cells only. Invalidated artifacts are preserved in the repository as historical record and are not cited as primary evidence. Full details are in the supplement (S2).
Overall verdict: Contradicted due to F3 neutral-control violation.
F3 fired with mean neutral-domain bias delta (DPO - base) = -0.182, 95% CI [-0.205, -0.161]. Both families showed consistent negative direction (Tulu 3: -0.172, n = 300; OLMo: -0.193, n = 300). The training pipeline introduced directional bias on prompts designed to carry no social content, violating the sharpening-only prediction.
All other falsification checks passed (F1, F2, F4, F5 CLEAR). Confound audit: CLEAN. Cross-family replication (F5): both families showed same-sign effects across all three seeds with CIs excluding zero.
This result is immutable and preserved across all subsequent analyses.
The v1.0 neutral control measured bias via P("the") - P("a") on 100 synthetic prompts. The F3 trigger fired, contradicting H1. Subsequent diagnostic investigation revealed that this metric was conceptually unsuited for the intended construct: the tokens "the" and "a" are high-frequency function words with no bias-relevant meaning, and the base model already exhibited nonzero mean bias_score (+0.176) on these prompts. The DPO shift moved this arbitrary baseline toward zero, which the directional metric registered as a large effect.
Distribution-level diagnostics showed the neutral domain exhibited no anomalous disturbance:
Why the v1.0 result is preserved. The original neutral-control design was part of the pre-registered protocol, and its outcome is immutable regardless of subsequent diagnostic findings. Preserving contradicted results is standard scientific practice and serves as an anti-HARKing safeguard: if only post-hoc-favorable outcomes were retained, the research record would be systematically biased toward confirmation. The v1.0 contradiction thus serves a dual function: (1) it is a genuine finding about metric sensitivity in token-level bias measurement, and (2) it demonstrates the study's commitment to pre-registration integrity. All subsequent analyses (v1.2, v2.0, V3.0) are explicitly labeled as additive and do not modify the original verdict.
Figure 3 (below) visualizes the neutral-control artifact: the token-pair metric shows large effects while distribution-level metrics show the neutral domain behaving comparably to social domains.
Under prospective adjudication with frozen distributional neutral controls and explicit independent Control B gate mapping (see Notation and gate terms for definitions):
Resolution: R1 (artifact-supported).
True-inference replication on BBQ under frozen governance baseline:
| Stage | Scope | Cells | Evidence source |
|---|---|---|---|
| Stage-1 | v2 existing replication | provenance-closed | stage1_scientific_adjudication_packet.json (valid) |
| Stage-2 | BBQ corrective | 18/18 complete | stage2_modal_true_inference_provenance.json (replacement) |
| Stage-3 | BBQ corrective | 54/54 complete | stage3_modal_true_inference_provenance.json (replacement) |
Cross-stage synthesis (BBQ scope only): R1 within BBQ-scoped cells across all three stages. Evidence for Stage-2/3 corrective conclusions is drawn exclusively from the replacement closure and provenance artifacts (see Artifact invalidation and replacement and supplement S2).
Table 1. Mean bias-score delta (DPO - base) by domain and family (n = 300 per cell; 95% bootstrap CI). If prioritizing bias mitigation resources across domains, this suggests gender-occupation shows the largest cross-family effect, but note effects are dataset-specific and do not generalize beyond the tested prompt panel.
| Domain | Family | Mean delta-bias_score | 95% CI | |Effect| | CI excludes zero |
|---|---|---|---|---|---|
| Gender-occupation | OLMo | -0.129 | [-0.152, -0.107] | 0.129 | Yes |
| Gender-occupation | Tulu 3 | -0.006 | [-0.016, +0.004] | 0.006 | No |
| Gender-occupation | Pooled | -0.068 | [-0.081, -0.056] | 0.068 | Yes |
| Race-crime | OLMo | -0.085 | [-0.125, -0.045] | 0.085 | Yes |
| Race-crime | Tulu 3 | +0.056 | [+0.037, +0.075] | 0.056 | Yes |
| Race-crime | Pooled | -0.015 | [-0.038, +0.008] | 0.015 | No |
| Age-technology | OLMo | -0.106 | [-0.149, -0.066] | 0.106 | Yes |
| Age-technology | Tulu 3 | +0.005 | [-0.018, +0.027] | 0.005 | No |
| Age-technology | Pooled | -0.050 | [-0.075, -0.026] | 0.050 | Yes |
Table 2. Mean JSD by domain and stage transition (pooled across families and seeds; source: results/jsd_results.csv). If evaluating which training transition introduces the most distributional change, this suggests the SFT-to-DPO step dominates for gender-occupation while base-to-SFT dominates for other domains, but note JSD magnitude depends on the specific token targets measured.
| Domain | JSD (base-to-SFT) | JSD (SFT-to-DPO) | JSD (base-to-DPO) |
|---|---|---|---|
| Gender-occupation | 0.011 | 0.021 | 0.034 |
| Race-crime | 0.009 | 0.002 | 0.013 |
| Age-technology | 0.012 | 0.002 | 0.017 |
| Neutral | 0.010 | 0.003 | 0.018 |
Gender-occupation shows the largest cumulative JSD shift (0.034), driven primarily by the SFT-to-DPO transition (0.021, 62% of total). Race-crime and age-technology show smaller cumulative shifts (0.013, 0.017) with the base-to-SFT transition contributing the majority. The neutral domain (0.018) is comparable to social domains, consistent with the distribution-level findings above.
Table 3. Contribution decomposition by domain, family, and metric (bootstrap point estimates with 95% CI). If interpreting stage-wise responsibility for bias shifts, this suggests neither stage universally dominates, but note Tulu 3 gender-occupation and age-technology are non-identifiable due to small total shift magnitude.
| Domain | Family | Metric | SFT share | SFT 95% CI | DPO share | DPO 95% CI | Identifiable |
|---|---|---|---|---|---|---|---|
| Gender-occ | OLMo | bias_score | 0.34 | [0.20, 0.43] | 0.66 | [0.57, 0.80] | Yes |
| Gender-occ | OLMo | entropy | 0.49 | [0.21, 0.72] | 0.51 | [0.28, 0.79] | Yes |
| Gender-occ | Tulu 3 | bias_score | - | - | - | - | No (70% non-id) |
| Gender-occ | Tulu 3 | entropy | 0.37 | [0.09, 0.61] | 0.63 | [0.39, 0.91] | Yes |
| Race-crime | OLMo | bias_score | -0.14 | [-0.74, 0.25] | 1.14 | [0.75, 1.74] | Yes |
| Race-crime | OLMo | entropy | 0.16 | [-0.32, 0.43] | 0.84 | [0.57, 1.32] | Yes |
| Race-crime | Tulu 3 | bias_score | 1.23 | [1.01, 1.48] | -0.23 | [-0.48, -0.01] | Yes |
| Race-crime | Tulu 3 | entropy | 0.56 | [0.38, 0.69] | 0.44 | [0.31, 0.62] | Yes |
| Age-tech | OLMo | bias_score | 0.78 | [0.67, 0.88] | 0.22 | [0.12, 0.33] | Yes |
| Age-tech | OLMo | entropy | 0.63 | [0.53, 0.71] | 0.37 | [0.29, 0.47] | Yes |
| Age-tech | Tulu 3 | bias_score | - | - | - | - | No |
| Age-tech | Tulu 3 | entropy | 0.29 | [0.01, 0.48] | 0.71 | [0.52, 0.99] | Yes |
| Pooled | All | bias_score | 0.19 | [0.06, 0.30] | 0.81 | [0.70, 0.94] | Yes |
| Pooled | All | entropy | 0.43 | [0.35, 0.50] | 0.57 | [0.50, 0.65] | Yes |
Table 5. Neutral-domain metrics comparison: token-pair (bias_score) vs. distributional (entropy, JSD). If selecting neutral-control metrics for bias measurement, this suggests token-pair metrics may register artifacts from function-word frequency shifts, but note distributional metrics have their own sensitivity to vocabulary coverage.
| Metric | Family | Value | Threshold | Status |
|---|---|---|---|---|
| bias_score delta | Tulu 3 | -0.172 | |effect| > 0.10 | FIRED |
| bias_score delta | OLMo | -0.193 | |effect| > 0.10 | FIRED |
| bias_score delta | Pooled | -0.182 | |effect| > 0.10 | FIRED |
| Entropy delta | Tulu 3 | -0.041 | |delta| >= 0.10 | CLEAR |
| Entropy delta | OLMo | -0.042 | |delta| >= 0.10 | CLEAR |
| Entropy delta | Pooled | -0.041 | |delta| >= 0.10 | CLEAR |
| JSD ratio | Pooled | 0.848 | ratio >= 2.0 | CLEAR |
Table 6. Cross-domain entropy delta comparison (pooled). If assessing whether the neutral domain behaves anomalously relative to social domains, this suggests the neutral entropy delta falls within the range observed for social domains, but note entropy is a coarse measure of distributional change.
| Domain | Mean delta-entropy | 95% CI | |Effect| |
|---|---|---|---|
| Neutral | -0.041 | [-0.054, -0.029] | 0.041 |
| Gender-occupation | -0.045 | [-0.055, -0.037] | 0.045 |
| Race-crime | -0.038 | [-0.047, -0.028] | 0.038 |
| Age-technology | -0.074 | [-0.086, -0.063] | 0.074 |
Original prereg v1.0 remains Contradicted (immutable). Under prospective v2.0 adjudication with frozen neutral controls and explicit independent-control mapping, results resolve to R1 (artifact-supported) on this dataset. A BBQ-scoped corrective replication (V3.0) provided BBQ-scoped corrective support for R1 under conservative governance framing, suggesting the original F3 trigger was sensitive to neutral-control metric choice. This does not erase v1.0 history, does not constitute independent confirmation of a universal mechanism, and remains scoped to the tested datasets and model families.
In plain terms: we set out to test whether the bias patterns observed after preference tuning were mostly an amplification of biases already present in the base model. Our initial analysis said "no" (the neutral control showed unexpected bias, which should not happen if only pre-existing biases were being amplified). However, we then discovered that the specific way we measured the neutral control (comparing probabilities of "the" versus "a") was picking up an irrelevant signal rather than genuine bias introduction. When we used broader distributional measures instead, the neutral control behaved normally. A follow-up replication on the BBQ benchmark supported this interpretation. The bottom line is: on our specific test datasets and models, the evidence is consistent with sharpening of pre-existing patterns rather than creation of new biases, but this conclusion is provisional, applies only to what we tested, and the original "no" result is permanently recorded as part of the scientific record.
Table 4. Social-domain effect sizes: bias-score and entropy deltas by domain and family. If comparing bias shift magnitude across social domains to prioritize auditing effort, this suggests age-technology shows the largest entropy reduction while gender-occupation shows the largest directional shift for OLMo, but note these are pilot-scale observations (n = 300 per cell) on a single prompt panel.
| Domain | Family | delta-bias_score | bias 95% CI | delta-entropy | entropy 95% CI | n |
|---|---|---|---|---|---|---|
| Gender-occupation | OLMo | -0.129 | [-0.152, -0.107] | -0.063 | [-0.079, -0.048] | 300 |
| Gender-occupation | Tulu 3 | -0.006 | [-0.016, +0.004] | -0.028 | [-0.034, -0.021] | 300 |
| Race-crime | OLMo | -0.085 | [-0.125, -0.045] | -0.041 | [-0.057, -0.026] | 300 |
| Race-crime | Tulu 3 | +0.056 | [+0.037, +0.075] | -0.034 | [-0.044, -0.025] | 300 |
| Age-technology | OLMo | -0.106 | [-0.149, -0.066] | -0.090 | [-0.108, -0.072] | 300 |
| Age-technology | Tulu 3 | +0.005 | [-0.018, +0.027] | -0.059 | [-0.073, -0.044] | 300 |
Cross-family pattern: OLMo shows consistently negative bias-score deltas across all social domains (toward less stereotyped association after DPO). Tulu 3 shows near-zero or mildly positive effects, with the race-crime domain being the only case where Tulu 3 shows a statistically significant positive shift (+0.056). Entropy deltas are consistently negative across both families and all domains, indicating distributional concentration (sharpening) as a general effect of the training pipeline. The magnitude of entropy reduction is largest in age-technology for both families (OLMo: -0.090; Tulu 3: -0.059).
The original pre-registered analysis yielded a Contradicted verdict. All subsequent analyses (v1.2, v2.0, V3.0) are additive; they do not erase or overwrite this outcome. The contradiction is preserved in the permanent record.
The original neutral-control check used an arbitrary function-word pair (" the" / " a") as target tokens. While diagnostic investigation attributed the F3 firing to this choice, the study cannot definitively rule out that the neutral domain was partially confounded by model-specific token frequency distributions. The amended distributional metrics (entropy delta, JSD ratio) are more robust but remain dataset-specific.
All findings are restricted to:
No claims are made about generalizability to other datasets, domains, model scales, or preference-tuning methods beyond DPO.
The bias measurement relies on token-level probability comparisons (pronoun/association tokens). Models may express bias through word choice, framing, refusal patterns, or generation-level semantics not captured by this metric. This is an acknowledged limitation of the pilot design.
Fine-tuning generically increases model confidence. The neutral-control design partially addresses whether observed shifts are bias-specific versus global confidence effects, but this disentanglement is not complete. The amended distributional controls improve but do not fully resolve this concern.
Local memory constraints required reducing bootstrap resamples from B = 10,000 to B = 1,000. Convergence diagnostics (supplement S9) show CI widths stabilize with mean relative change of 0.8% from B = 750 to B = 1,000, indicating B = 1,000 is adequate for this pilot.
V3.0 Stage-2/3 results are explicitly scoped to BBQ cells only. No broader external-validity claims are made from the corrective replacement.
Even under R1 resolution with BBQ-scoped corrective support across all three V3 stages, this study does not establish a universal mechanism for how preference tuning affects bias. Findings suggest artifact-sensitivity of the original neutral-control metric on these datasets; broader claims require independent replication on additional datasets, model families, and scale tiers.
This pre-registered pilot study tested whether preference-tuning bias shifts are predominantly explained by distributional sharpening (H1). The original analysis contradicted this hypothesis due to a neutral-control violation. Subsequent investigation attributed the violation to metric sensitivity (specifically, the choice of function-word target tokens in the neutral control) rather than genuine distributional bias introduction.
Prospective re-adjudication (v2.0) resolved to R1 (artifact-supported), and a BBQ-scoped corrective replication (V3.0) provided BBQ-scoped corrective support for R1, suggesting the original F3 trigger was sensitive to neutral-control metric choice. However, the v1.0 contradiction is preserved as immutable, and no hypothesis status has been upgraded to "supported" or "confirmed." All corrective Stage-2/3 conclusions are drawn exclusively from the replacement provenance artifacts and apply only to the BBQ dataset.
The artifact-sensitivity finding has provisional implications: if preference tuning primarily sharpens pre-existing associations (as would be consistent with R1 under corrected controls), mitigation effort may be more productively directed at pre-training data curation rather than post-training objective design. However, this interpretation remains provisional and dataset-scoped (custom pilot panel and BBQ only); it does not justify policy changes without broader replication across additional benchmarks and model families.
This study demonstrates a falsification-first protocol for bias decomposition research:
The F3 episode illustrates the value of neutral controls and the risk of metric-specific artifacts in bias measurement.
Key open questions:
These questions are deferred to a planned V4 extension.
This pre-registered pilot found the sharpening-dominant hypothesis (H1) contradicted under the original v1.0 protocol due to a neutral-control violation, an outcome preserved as immutable. Diagnostic investigation and prospective re-adjudication (v2.0, V3.0) attributed the violation to token-pair metric sensitivity and resolved to R1 (artifact-supported) on the tested datasets, with BBQ-scoped corrective support across all three V3 stages. These findings are consistent with preference tuning sharpening pre-existing distributional associations rather than introducing novel bias pathways, but this interpretation is provisional, applies only to the custom pilot panel and BBQ benchmark at 7-8B scale (OLMo 2, Tulu 3), and does not establish a universal mechanism. The study contributes a falsification-first protocol for bias decomposition and demonstrates the sensitivity of token-level neutral controls to function-word measurement artifacts.
All governance artifacts (frozen baselines, SHA256 provenance chains, alias audits, invalidation notices, and amendment histories) are preserved in the supplement (S1-S4) and the project repository. Zero freeze breaches, zero alias regressions, and zero mid-cycle metric changes were detected across all phases. The v1.0 contradiction is verified as preserved in v1.2, v2.0, and V3.0 reporting.
This work was conducted as an independent research study. V3 BBQ corrective inference was executed on Modal A10G GPUs. We thank the open-source teams behind OLMo, Tulu, and BBQ.
Code, frozen artifacts, and configuration files are available at: https://github.com/Sohailm25/rlhf-entropy-pilot
Direct paths: paper/, figures/, results/
Reproduction details:
42, 137, 20265afdbd9b53380dcf; primary decode config 79cb2354784327c8; V3 governance baseline 8fef67d71896249bcf0fe2c4f61ee5eb728adfcd6e57521911e3442b017324cdB=1000 with convergence diagnostics in supplement §S9results/v3/reports/PROGRAM_CLOSURE_PACKET_2026-02-21.mdBai, Y., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
Blodgett, S. L., Barocas, S., Daume III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of bias in NLP. Proceedings of ACL.
Gallegos, I. O., et al. (2024). Bias and fairness in large language models: A survey. Computational Linguistics, 50(4).
Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models. Proceedings of ACL.
Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. Proceedings of EMNLP.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in NeurIPS, 35.
Parrish, A., et al. (2022). BBQ: A hand-built bias benchmark for question answering. Findings of ACL.