Condition-Dependent Collapse Dynamics in Multi-Turn LLM Self-Play

When chat models talk to themselves for 40 turns, some collapse into repetition. This study measures how much, and where the measurement breaks down.

Abstract

When two chat models converse for 40 turns, do they keep producing novel responses, or collapse into repetitive loops? We measured this across 720 conversations spanning four model-pairing conditions (three homogeneous self-play setups and one heterogeneous rotation). All 720 conversations ran to completion under a fixed protocol. Collapse rates varied substantially by condition, but our automated collapse detector did not pass its preregistered reliability check (Cohen's κ = 0.566 vs. threshold 0.80).Cohen's κ accounts for chance agreement, unlike raw accuracy. Our 92.8% raw agreement produces only κ = 0.566 because collapse base rates are skewed, inflating chance agreement. Because of this, all findings are descriptive: we report observed patterns, not validated detector claims or causal explanations.

Reliability disclosure. The detector reliability prereg gate was NOT MET (κ=0.566). This page mirrors that limitation explicitly and does not reinterpret the gate as passed. This is a reproducible baseline dataset and protocol outcome, not a validated detector claim.

Primary contributions: (1) a locked, reproducible measurement protocol for multi-turn collapse with frozen artifacts; (2) a complete 720-trajectory descriptive benchmark; and (3) a transparent account of a preregistered reliability gate failure and its consequences for claim scope. We believe this is independently informative for researchers designing LLM evaluation pipelines.

Why this matters

Multi-turn conversations are where most real LLM usage happens, and where quality is hardest to measure. If models quietly degrade into repetitive loops after enough turns, that matters for anyone building chat products, agents, or evaluation pipelines. But measuring this degradation reliably turns out to be harder than measuring it at all.

This study contributes a locked protocol and complete dataset for studying the problem. It also contributes a transparent account of a reliability gate failure: our automated detector didn't pass its own preregistered agreement check, which we think is independently useful for researchers designing LLM evaluation systems.

At a glance

Conversations completed
720 / 720
40 turns each, zero protocol attrition
Detector reliability
κ = 0.566
below prereg threshold (0.80)
Conditions tested
4
3 homogeneous + 1 heterogeneous rotation
Trajectory depth
40 turns
per conversation under locked criteria
Finding 1 — Condition effects are visible in the baseline data
Across 720 completed trajectories, collapse behavior varies materially by condition, with heterogeneous rotation showing lower collapse than the highest-collapse homogeneous setup in this protocol.
Finding 2 — Reliability gate failure constrains claim strength
The detector reached high raw agreement but only κ = 0.566, below the preregistered 0.80 threshold, so conclusions are descriptive and comparative rather than confirmatory detector-validity claims.
Finding 3 — Main contribution is protocol closure and transparency
The study delivers a fully executed 720/720 benchmark with locked collapse rules, intact protocol integrity, and explicit disclosure of where interpretation must stop.

Use this as a reproducible baseline for future collapse-measurement work, especially reliability-improvement and human-rater validation studies.

How collapse is defined (locked criteria)

We do not label collapse by vibe. A turn is marked as collapsed only when both repetition and low-drift conditions are met under fixed thresholds:

Study setup

Seed prompts

The 36 seed prompts (SEEDS_V2) were drawn from six topical buckets: factual Q&A, instruction following, creative rewrite, argumentative, conversational, and practical task, with six prompts per bucket. Examples: "Explain why eclipses do not happen every month" (factual), "Give me exactly 3 bullet points on how to prepare for a technical interview" (instruction), "Argue both for and against using open-weight models in production safety-critical systems" (argumentative). The full seed set is available in the companion repository.

Context window handling

The full conversation history is passed to the model at each turn without truncation. At 40 turns with up to 256 tokens each, conversations can reach ~10,000 tokens. The models have context windows of 8,192 (Llama, Mistral) to 32,768 (Qwen) tokens. In practice, collapsed turns produce short outputs, keeping most conversations within limits. We did not implement explicit truncation; if context exceeded the model's window, internal left-truncation would apply. This is a limitation: later-turn behavior in longer conversations may be partially confounded by context overflow.

Condition mapping (model identity)

ConditionModel setup
HOMO_ALlama-3.1-8B-Instruct self-play (same model on both sides)
HOMO_BQwen2.5-7B-Instruct self-play (same model on both sides)
HOMO_CMistral-7B-Instruct-v0.3 self-play (same model on both sides)
HETERO_ROTRound-robin heterogeneous rotation across the three model families

What collapse looks like in plain language

The core question is simple: when two chat models keep talking for 40 turns, do they keep producing meaningfully new content, or do they fall into repetitive loops?

Below are representative excerpts selected from raw turn logs after scanning all 180 cached trajectories. All examples are descriptive illustrations rather than causal evidence.

Example A: strong repetition loop (HOMO_B: Qwen self-play)

Turn 18: "And there you have it! The function absolutely nails it, delivering the exact value we were hoping for ..."
Turn 24: "And there you have it! The function absolutely nails it, delivering the exact value we were hoping for ..."
Turn 39: "Got it! Here's your comment: And there you have it! The function absolutely nails it ..."

Interpretation: near-identical template reuse over long spans, which is exactly the collapse behavior this benchmark is designed to capture.

Example B: low-collapse trajectory (HOMO_C: Mistral self-play)

Turn 18: "Bias mitigation ... fairness-aware machine learning algorithms ..."
Turn 24: "Explainability and transparency are also important for regulatory compliance ..."
Turn 39: "Thank you ... don't hesitate to reach out if you have other questions ..."

Interpretation: still topically related, but content continues to move and reframe rather than locking into one repeated response shell.

Example C: intermediate pattern (HETERO_ROT: heterogeneous rotation)

Turn 18: "Develop and implement policies and regulations to reduce greenhouse gas emissions ..."
Turn 30: "By working together, we can make a significant difference ..."
Turn 39: "I hope this document provided a useful framework ..."

Interpretation: some thematic repetition, but less rigid template-locking than high-collapse homogeneous runs.

Key baseline results

Mean collapse rate by condition
Figure 1. Mean collapse rate by condition. HOMO_C (Mistral) collapses most frequently; HETERO_ROT (heterogeneous rotation) is lowest. Use this to compare baseline susceptibility across model families, not to rank models for deployment. κ = 0.566 means these rates carry label uncertainty.
Collapse rate distribution by condition
Figure 2. Distribution of collapse rates across tuples. The spread within each condition shows seed-level variance. Wide distributions mean individual conversations vary substantially even under identical setup. Do not use narrow means from Figure 1 without considering this spread.
First collapse turn distribution
Figure 3. Distribution of first-collapse turn index. Earlier peaks indicate faster deterioration. If building multi-turn systems, this suggests where to place quality checkpoints, but note this reflects our 40-turn protocol with specific seed prompts, not arbitrary deployment conditions.

Interpretation scope

Claim typeStatusScope
Baseline execution closureSupported720/720 tuple completion with protocol integrity checks.
Condition-comparative collapse patternsSupported (descriptive)Reported under locked detector outputs.
Detector reliability validationNot supportedPreregistered κ gate not met (0.566 < 0.80).

Decision relevance

What this study shows

What this study does NOT show

Highest-value next step: independent human-rater reliability audit to test whether condition-level patterns persist under stronger label quality.

Limitations

Acknowledgments

This work was conducted as an independent research study. We thank the open-source model and evaluation communities whose tooling supported this benchmark cycle.

Reproducibility

Code, frozen artifacts, and configuration files are available at: https://github.com/Sohailm25/escape-velocity
Direct paths: README.md, ARTIFACT_INDEX.md, REPRODUCIBILITY.md