Teaching an LLM to Trade Prediction Markets

Abstract

We trained language models to trade binary prediction markets using reinforcement learning. Initial attempts with standard PPO led to severe action collapse: agents converged to never holding positions (0% HOLD rate by batch 30), producing degenerate policies despite reasonable profit. Adding chain-of-thought (CoT) reasoning before action selection eliminated the observed collapse in this training setup: CoT agents maintained 15-30% HOLD rates and ~0.95 policy entropy throughout training while achieving comparable performance (+$0.060 vs +$0.063 for always-buy-no baseline). This is consistent with the idea that in low-cardinality action spaces where multiple actions can yield similar rewards, explicit reasoning steps can reduce premature convergence by maintaining exploration of the strategic landscape.

At a glance

Training / eval markets

400 / 40

resolved binary Manifold markets

CoT PPO return

+$0.060

vs +$0.008 to +$0.012 for non-CoT PPO

Policy entropy

~0.95

CoT maintained high entropy; non-CoT ~0.60

HOLD collapse

by batch 30 without CoT; CoT kept 15–30% in training

Finding 1 — In this setup, CoT removed the observed collapse pattern

In this 3-action setup, standard PPO converges to action-collapse, while requiring reasoning tokens preserves HOLD behavior and higher policy diversity throughout training.

Finding 2 — Better behavioral quality without sacrificing competitiveness

CoT PPO remains near the strongest simple baseline on mean P&L while avoiding the degenerate low-entropy policy profile seen in non-CoT variants.

Finding 3 — Action-space structure matters as much as reward shaping

Shaped rewards alone did not prevent collapse, suggesting that explicit deliberation can provide missing credit-assignment structure where delayed rewards under-inform HOLD.

This supports a testable design heuristic: in low-cardinality RL with sparse delayed payoffs, evaluate explicit reasoning before action selection as one anti-collapse intervention.

1. Introduction

Teaching language models to make financial decisions is challenging. The action space is discrete and low-cardinality (BUY_YES, BUY_NO, HOLD), rewards are noisy and delayed, and multiple strategies can yield similar outcomes. Standard reinforcement learning easily falls into the trap of action collapse—converging to a single action regardless of context.

We approached this problem through prediction markets: binary yes/no questions with evolving probabilities and trading opportunities. Our agent observes market state (current probability, question text, resolution criteria) and chooses whether to buy yes, buy no, or hold. The reward is realized profit when the market resolves.

Key findings:

Action collapse is severe. PPO without CoT reduces HOLD rate from 40% → 0% by batch 30, despite achieving positive returns (+$0.008 to +$0.012).
CoT prevents collapse. Adding explicit reasoning maintains 15-30% HOLD and 0.95 entropy while achieving +$0.060 mean P&L.
Shaped rewards don't help. PPO v3 with carefully tuned intermediate rewards showed the same collapse as PPO v2.
Scale matters. Our numeric-only PPO v1 agent (trained on prices, not text) achieved +2.28% with 90% win rate, suggesting text-based reasoning introduces additional challenges.

2. Dataset

We used binary prediction markets from Manifold Markets, filtering for:

Binary yes/no resolution
Resolved questions (ground truth available)
Sufficient trading history (≥20 probability updates)
No "cancelled" or "N/A" resolutions

Final dataset: 400 markets for training, 40 for evaluation. Each market provides question text, description, resolution criteria, and a time series of probability updates representing trading opportunities.

Example market:

Question: Will the new iPhone be released by September 2024?
Resolution: YES
Probability history: 0.45 → 0.52 → 0.67 → 0.78 → 0.89 (resolves YES)

3. Method: From GRPO to PPO to CoT

3.1 Phase 1: GRPO (Failed)

We initially tried Group Relative Policy Optimization (GRPO), sampling multiple actions per state and using relative ranking for rewards. This failed catastrophically:

Agent learned to always output the same action regardless of context
No meaningful learning signal despite 200+ training markets
Reward variance too high (outcomes depend on future price movements not visible at decision time)

GRPO works well for response ranking in chat models but breaks down when reward variance is high and actions have delayed, stochastic outcomes.

3.2 Phase 2: PPO (Partial Success, Action Collapse)

PPO v2: Standard PPO with terminal profit reward only. Achieved +$0.008 mean P&L but collapsed to 0% HOLD by batch 30.

PPO v3: Added shaped intermediate rewards:

Correct direction bonus: +0.01 for buying the eventual winner
Early exit penalty: Small penalty for holding too long without trading
Diversity bonus: Reward for action variety

Result: Identical collapse. PPO v3 reached +$0.012 (slightly better) but still fell to 0% HOLD and 0.60 entropy.

3.3 Phase 3: CoT PPO (Success)

We modified the architecture to require explicit reasoning:

Prompt template:
You are trading a prediction market. Think step by step.

Question: {question}
Current probability: {prob}
Your position: {position}

Reasoning: [model generates analysis]
Action: [BUY_YES | BUY_NO | HOLD]

The model must generate reasoning tokens before selecting an action. In our runs, this change prevented the observed collapse pattern:

HOLD rate: 15-30% throughout training (vs 0% for non-CoT)
Entropy: ~0.95 (vs 0.60 for non-CoT)
Mean P&L: +$0.060 (comparable to best baseline)

4. Results

HOLD percentage over training — **Figure 1:** HOLD action distribution during PPO training. Non-CoT methods (red/orange) collapse to 0% while CoT (green) maintains 15-30%.

Entropy over training — **Figure 2:** Policy entropy during training. CoT maintains ~0.95 normalized entropy while non-CoT crashes to ~0.60.

Mean reward over training — **Figure 3:** Mean reward during training across all three PPO variants.

Final P&L comparison — **Figure 4:** Mean P&L by agent. CoT PPO achieves competitive performance while maintaining policy diversity.

4.1 Quantitative Results

Agent	Mean P&L	Win Rate	HOLD %	Entropy
v1 Numeric PPO	+$0.0228 (+2.28%)	90%	N/A	N/A
always_buy_no	+$0.063	60%	N/A	N/A
CoT PPO B50	+$0.060	32.5%	5.0%	~0.95
PPO v3 B30	+$0.012	72.5%	0%	~0.60
PPO v2 B30	+$0.008	N/A	0%	~0.60
CoT SFT	-$0.029	27.5%	5.0%	N/A
Random	-$0.1456	N/A	N/A	N/A

Key observations:

CoT prevents collapse: Maintains 5% HOLD at evaluation time (15-30% during training) vs 0% for non-CoT PPO
Competitive performance: +$0.060 vs +$0.063 for the simple always-buy-no baseline
Lower win rate, better reasoning: 32.5% win rate suggests CoT agent is more selective, not just exploiting easy wins
SFT alone fails: Supervised fine-tuning on successful trades without RL yields negative P&L

5. The HOLD Collapse Problem

Why does HOLD disappear in non-CoT agents?

In prediction markets, HOLD is often the correct action:

When probabilities accurately reflect true odds
When you're uncertain about direction
When transaction costs exceed expected edge
When you're already positioned correctly

But HOLD provides no immediate feedback—you only learn whether holding was correct when the market resolves, hundreds of timesteps later. The RL signal for HOLD is sparse and delayed.

In contrast, BUY_YES and BUY_NO provide immediate state changes (you now have a position). Even if these actions are suboptimal, they produce clearer gradients. The agent learns "buying does something" faster than it learns "holding is often optimal."

With a 3-action space and noisy rewards, standard PPO gravitates toward actions with stronger gradients, even if those actions aren't globally optimal. By batch 30, the agent has effectively reduced to a 2-action policy.

6. Why CoT Works

Chain-of-thought reasoning introduces several mechanisms that prevent collapse:

6.1 Explicit Exploration Space

The reasoning tokens create an intermediate latent space where the model explores why to take an action before committing to it. This slows down convergence and forces consideration of multiple paths.

6.2 Compositional Credit Assignment

With CoT, the model learns "reasoning patterns that lead to good actions" rather than just "good actions." This compositional structure provides more learning signal from sparse rewards.

6.3 Implicit Regularization

Generating 50-100 reasoning tokens adds stochasticity to the forward pass. Even with the same state input, different reasoning paths can lead to different actions. This acts as a form of entropy regularization.

6.4 Human-Like Decision Process

The model was pretrained on human text that includes deliberation before decisions. CoT likely activates this prior, making "think then act" a more natural behavior than "act immediately."

7. Tinker vs Prime: The Importance of Architecture

We ran these experiments on two model bases:

Tinker (Qwen 2.5 0.5B): Small, fast, easy to iterate. All reported CoT results.
Prime (Qwen 2.5 7B): Larger, more capable, but slower to train. Initial PPO v1 numeric results.

The v1 Numeric PPO agent (+2.28%, 90% win rate) was trained on Prime with only numeric features (current price, historical prices, position value). No text, no reasoning, just numbers.

This creates an interesting comparison:

Agent	Model	Input	Mean P&L	Win Rate
v1 Numeric PPO	Prime (7B)	Numbers only	+2.28%	90%
CoT PPO	Tinker (0.5B)	Text + CoT	+$0.060	32.5%
PPO v2/v3	Tinker (0.5B)	Text, no CoT	+$0.008–0.012	72.5%

Why does numeric-only Prime outperform text-based Tinker?

Scale advantage: 7B vs 0.5B parameters
Simpler input space: Numbers are lower-dimensional than text embeddings
Direct optimization: No linguistic reasoning overhead, pure pattern matching on price movements

The fact that CoT on Tinker achieves 60% of Prime's performance with 1/14th the parameters suggests that reasoning helps bridge the capability gap, but doesn't fully close it.

8. Limitations

Small model: Tinker (0.5B) is far below frontier scale. Larger models might not exhibit the same collapse behavior.
Limited evaluation set: 40 test markets is not enough to make strong claims about generalization.
Single domain: Prediction markets are unusual—binary outcomes, public probabilities, no adversarial traders. This may not transfer to stock trading, options, or other financial domains.
No transaction costs: Real markets have fees, slippage, and liquidity constraints we didn't model.
Hindsight bias: We trained on resolved markets where the outcome is known. Real-time trading would be harder.
No ablation on CoT length: We didn't systematically test shorter vs longer reasoning chains.

9. Conclusion

In low-cardinality action spaces with sparse, delayed rewards, standard RL can collapse to degenerate policies even when achieving positive returns. Chain-of-thought reasoning prevents this collapse by introducing compositional structure and implicit exploration.

The key insight: reasoning tokens act as a form of structured regularization that preserves policy diversity during training. This suggests that for complex decision-making tasks, we may want to explicitly build in deliberation steps rather than hoping end-to-end optimization will learn them.

Future work should explore:

Scaling to larger models (does collapse still occur at 7B+?)
Ablating CoT length and structure
Testing on other low-cardinality RL domains
Real-time trading with live markets
Multi-step reasoning with uncertainty quantification