Teaching an LLM to Trade Prediction Markets

Chain-of-Thought Reasoning Solves Action Collapse in Low-Cardinality RL

Abstract

We trained language models to trade binary prediction markets using reinforcement learning. Initial attempts with standard PPO led to severe action collapse: agents converged to never holding positions (0% HOLD rate by batch 30), producing degenerate policies despite reasonable profit. Adding chain-of-thought (CoT) reasoning before action selection eliminated the observed collapse in this training setup: CoT agents maintained 15-30% HOLD rates and ~0.95 policy entropy throughout training while achieving comparable performance (+$0.060 vs +$0.063 for always-buy-no baseline). This is consistent with the idea that in low-cardinality action spaces where multiple actions can yield similar rewards, explicit reasoning steps can reduce premature convergence by maintaining exploration of the strategic landscape.

At a glance

Training / eval markets
400 / 40
resolved binary Manifold markets
CoT PPO return
+$0.060
vs +$0.008 to +$0.012 for non-CoT PPO
Policy entropy
~0.95
CoT maintained high entropy; non-CoT ~0.60
HOLD collapse
0%
by batch 30 without CoT; CoT kept 15–30% in training
Finding 1 — In this setup, CoT removed the observed collapse pattern
In this 3-action setup, standard PPO converges to action-collapse, while requiring reasoning tokens preserves HOLD behavior and higher policy diversity throughout training.
Finding 2 — Better behavioral quality without sacrificing competitiveness
CoT PPO remains near the strongest simple baseline on mean P&L while avoiding the degenerate low-entropy policy profile seen in non-CoT variants.
Finding 3 — Action-space structure matters as much as reward shaping
Shaped rewards alone did not prevent collapse, suggesting that explicit deliberation can provide missing credit-assignment structure where delayed rewards under-inform HOLD.

This supports a testable design heuristic: in low-cardinality RL with sparse delayed payoffs, evaluate explicit reasoning before action selection as one anti-collapse intervention.

1. Introduction

Teaching language models to make financial decisions is challenging. The action space is discrete and low-cardinality (BUY_YES, BUY_NO, HOLD), rewards are noisy and delayed, and multiple strategies can yield similar outcomes. Standard reinforcement learning easily falls into the trap of action collapse—converging to a single action regardless of context.

We approached this problem through prediction markets: binary yes/no questions with evolving probabilities and trading opportunities. Our agent observes market state (current probability, question text, resolution criteria) and chooses whether to buy yes, buy no, or hold. The reward is realized profit when the market resolves.

Key findings:

2. Dataset

We used binary prediction markets from Manifold Markets, filtering for:

Final dataset: 400 markets for training, 40 for evaluation. Each market provides question text, description, resolution criteria, and a time series of probability updates representing trading opportunities.

Example market:

Question: Will the new iPhone be released by September 2024?
Resolution: YES
Probability history: 0.45 → 0.52 → 0.67 → 0.78 → 0.89 (resolves YES)

3. Method: From GRPO to PPO to CoT

3.1 Phase 1: GRPO (Failed)

We initially tried Group Relative Policy Optimization (GRPO), sampling multiple actions per state and using relative ranking for rewards. This failed catastrophically:

GRPO works well for response ranking in chat models but breaks down when reward variance is high and actions have delayed, stochastic outcomes.

3.2 Phase 2: PPO (Partial Success, Action Collapse)

PPO v2: Standard PPO with terminal profit reward only. Achieved +$0.008 mean P&L but collapsed to 0% HOLD by batch 30.

PPO v3: Added shaped intermediate rewards:

Result: Identical collapse. PPO v3 reached +$0.012 (slightly better) but still fell to 0% HOLD and 0.60 entropy.

3.3 Phase 3: CoT PPO (Success)

We modified the architecture to require explicit reasoning:

Prompt template:
You are trading a prediction market. Think step by step.

Question: {question}
Current probability: {prob}
Your position: {position}

Reasoning: [model generates analysis]
Action: [BUY_YES | BUY_NO | HOLD]

The model must generate reasoning tokens before selecting an action. In our runs, this change prevented the observed collapse pattern:

4. Results

HOLD percentage over training
Figure 1: HOLD action distribution during PPO training. Non-CoT methods (red/orange) collapse to 0% while CoT (green) maintains 15-30%.
Entropy over training
Figure 2: Policy entropy during training. CoT maintains ~0.95 normalized entropy while non-CoT crashes to ~0.60.
Mean reward over training
Figure 3: Mean reward during training across all three PPO variants.
Final P&L comparison
Figure 4: Mean P&L by agent. CoT PPO achieves competitive performance while maintaining policy diversity.

4.1 Quantitative Results

Agent Mean P&L Win Rate HOLD % Entropy
v1 Numeric PPO+$0.0228 (+2.28%)90%N/AN/A
always_buy_no+$0.06360%N/AN/A
CoT PPO B50+$0.06032.5%5.0%~0.95
PPO v3 B30+$0.01272.5%0%~0.60
PPO v2 B30+$0.008N/A0%~0.60
CoT SFT-$0.02927.5%5.0%N/A
Random-$0.1456N/AN/AN/A

Key observations:

5. The HOLD Collapse Problem

Why does HOLD disappear in non-CoT agents?

In prediction markets, HOLD is often the correct action:

But HOLD provides no immediate feedback—you only learn whether holding was correct when the market resolves, hundreds of timesteps later. The RL signal for HOLD is sparse and delayed.

In contrast, BUY_YES and BUY_NO provide immediate state changes (you now have a position). Even if these actions are suboptimal, they produce clearer gradients. The agent learns "buying does something" faster than it learns "holding is often optimal."

With a 3-action space and noisy rewards, standard PPO gravitates toward actions with stronger gradients, even if those actions aren't globally optimal. By batch 30, the agent has effectively reduced to a 2-action policy.

6. Why CoT Works

Chain-of-thought reasoning introduces several mechanisms that prevent collapse:

6.1 Explicit Exploration Space

The reasoning tokens create an intermediate latent space where the model explores why to take an action before committing to it. This slows down convergence and forces consideration of multiple paths.

6.2 Compositional Credit Assignment

With CoT, the model learns "reasoning patterns that lead to good actions" rather than just "good actions." This compositional structure provides more learning signal from sparse rewards.

6.3 Implicit Regularization

Generating 50-100 reasoning tokens adds stochasticity to the forward pass. Even with the same state input, different reasoning paths can lead to different actions. This acts as a form of entropy regularization.

6.4 Human-Like Decision Process

The model was pretrained on human text that includes deliberation before decisions. CoT likely activates this prior, making "think then act" a more natural behavior than "act immediately."

7. Tinker vs Prime: The Importance of Architecture

We ran these experiments on two model bases:

The v1 Numeric PPO agent (+2.28%, 90% win rate) was trained on Prime with only numeric features (current price, historical prices, position value). No text, no reasoning, just numbers.

This creates an interesting comparison:

Agent Model Input Mean P&L Win Rate
v1 Numeric PPOPrime (7B)Numbers only+2.28%90%
CoT PPOTinker (0.5B)Text + CoT+$0.06032.5%
PPO v2/v3Tinker (0.5B)Text, no CoT+$0.008–0.01272.5%

Why does numeric-only Prime outperform text-based Tinker?

The fact that CoT on Tinker achieves 60% of Prime's performance with 1/14th the parameters suggests that reasoning helps bridge the capability gap, but doesn't fully close it.

8. Limitations

9. Conclusion

In low-cardinality action spaces with sparse, delayed rewards, standard RL can collapse to degenerate policies even when achieving positive returns. Chain-of-thought reasoning prevents this collapse by introducing compositional structure and implicit exploration.

The key insight: reasoning tokens act as a form of structured regularization that preserves policy diversity during training. This suggests that for complex decision-making tasks, we may want to explicitly build in deliberation steps rather than hoping end-to-end optimization will learn them.

Future work should explore:

Acknowledgments

This work was conducted as an independent research study. We thank the open-source RL and prediction-market communities whose tools and datasets enabled this experiment.

Reproducibility

Code and experiment assets are available at: https://github.com/Sohailm25/prime-v-tinker-trader
Direct paths: README.md, ARTIFACT_INDEX.md, REPRODUCIBILITY.md