- Reinforcement Learning
- Reasoning
- LLM
Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning
The Problem
Developing structured reasoning on par with professional financial analysts remains a central challenge in AI for finance, where markets demand interpretability and trust.
Existing approaches fall short in two different ways:
- Traditional time-series models predict prices but lack explainability.
- General-purpose LLMs can write natural-language analysis, but struggle to turn it into disciplined, executable trades.
Trading-R1 is a financially-aware reasoning model that closes this gap. It combines strategic thinking and planning for comprehensive thesis composition, facts-grounded analysis, and volatility-adjusted decision making. It aligns reasoning with trading principles through supervised fine-tuning (SFT) and reinforcement learning (RL) under a three-stage, easy-to-hard curriculum.
Contributions
We make four main contributions:
- Tauric-TR1-DB, a large-scale financial reasoning corpus. 100k filtered, high-quality samples spanning 18 months (January 2024 – May 2025) across 14 major tickers. The corpus integrates the heterogeneous data used by real traders: technical market data, fundamentals, news, insider sentiment, and macroeconomic indicators.
- Reverse chain-of-thought distillation. Proprietary reasoning models return conclusions without their intermediate steps. We reconstruct plausible reasoning traces from those black-box outputs and use them as supervision, so the model learns to write concise, interpretable investment theses.
- Reinforcement learning for execution-grade decisions. Trade recommendations are cast as an RL problem over the standard five-tier scale (Strong Buy, Buy, Hold, Sell, Strong Sell), with volatility-adjusted labels used as rewards.
- Trading-R1, a financial reasoning LLM for trading. Trained across diverse assets and market regimes (bull and bear), it produces both high-quality analyses and actionable trade recommendations.
Method
Because language models are conditional autoregressive generators, decision quality depends on two coupled priors:
- The external prior: the input context.
- The internal prior: the model's own intermediate reasoning.
Trading-R1 controls both. It curates high signal-to-noise, finance-grounded inputs, and it scaffolds how the model reasons toward a decision.
Volatility-driven labels
Rather than predict noisy exact prices, we discretize the output space into five actions. Labels come from a multi-horizon, volatility-aware procedure:
- Compute forward returns over 3-, 7-, and 15-day horizons.
- Normalize each by rolling 20-period volatility into Sharpe-like signals.
- Combine the horizons with weights 0.3, 0.5, and 0.2.
- Map the result to actions via asymmetric percentile thresholds that preserve the market's long-term upward drift.
These proxy labels supply both SFT targets and RL rewards, without expensive manual annotation.
Three-stage curriculum
SFT and reinforcement fine-tuning (RFT) are interleaved across three stages. Each stage is warm-started with SFT to set structural priors, then refined with RFT via task-specific rewards:
- Structure: reward professional thesis structure and consistent, XML-tagged formatting.
- Claims: reward claims supported by direct citations and quotations from the input, reducing hallucination.
- Decision: reward market-aligned recommendations using the volatility-aware labels.
This progression stabilizes intermediate reasoning and mitigates error compounding. The model first learns the form of professional analysis, then grounds it in evidence, and finally makes market-driven decisions.
Experiments
Setup
- Inputs and outputs: Trading-R1 processes multi-dimensional inputs (20–30k tokens) and generates full investment theses (6–8k tokens).
- Training: SFT uses LoRA. The pipeline was trained on 8×H100 (SFT) and 8×H200 (RL) servers.
- Evaluation: a strictly causal, out-of-sample backtest (June–August 2024, held out from training).
- Metrics: cumulative return (CR), Sharpe ratio (SR), hit rate (HR), and maximum drawdown (MDD).
Baselines
We compare against three families of models, plus ablations:
- Small models: Qwen-4B, GPT-4.1-nano/mini.
- Larger LLMs: GPT-4.1, LLaMA-3.3, LLaMA-Scout, Qwen3-32B.
- Reasoning models: DeepSeek, O3-mini, O4-mini.
- Ablations: SFT-only and RL-only variants.
Results
Across the evaluated equities and ETFs, Trading-R1 delivers the best balance of risk-adjusted return and drawdown, consistently ranking among the top models. Highlights:
- It outperforms GPT-4.1 on AAPL (Sharpe 1.80 vs 1.24) while holding a lower drawdown.
- It attains leading hit rates (70.0% on NVDA, 64.0% on SPY).
The overall hierarchy is SLM < RLM < LLM < Trading-SFT ≈ Trading-RFT < Trading-R1. This underscores that both scale and specialized reasoning matter for trading.
Trading-R1 (full model), reported metrics:
| Asset | CR% ↑ | SR ↑ | HR% ↑ | MDD% ↓ |
|---|---|---|---|---|
| NVDA | 8.08 | 2.72 | 70.0 | 3.80 |
| AAPL | 5.82 | 1.80 | 63.6 | 3.68 |
| MSFT | 2.38 | 0.87 | 60.4 | 1.90 |
| AMZN | 5.39 | 1.72 | 63.0 | 3.20 |
| META | 5.12 | 0.86 | 50.0 | 4.65 |
| SPY | 3.34 | 1.60 | 64.0 | 1.52 |
One finding stands out: off-the-shelf reasoning models often underperform general LLMs on trading. Their unguided, lengthy reasoning drifts away from market-relevant evidence. Trading-R1 deliberately prioritizes readability, interpretability, and structured argumentation over marginal metric gains, a trade-off we consider essential for real-world decision support.
Positioning
Trading-R1 is best suited for research support and structured analysis generation for financial professionals. It augments human decision-making where structured reasoning and interpretability are valued, rather than replacing it.
Future work targets real-time deployment, more sample-efficient offline RL, and additional data modalities. The Trading-R1 Terminal is released on GitHub.
Citation
@article{xiao2025trading,
title={Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning},
author={Xiao, Yijia and Sun, Edward and Chen, Tong and Wu, Fang and Luo, Di and Wang, Wei},
journal={arXiv preprint arXiv:2509.11420},
year={2025}
}