Research
  • Reinforcement Learning
  • Reasoning
  • LLM

Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning

Tauric Research

The Problem

Developing structured reasoning on par with professional financial analysts remains a central challenge in AI for finance, where markets demand interpretability and trust.

Existing approaches fall short in two different ways:

  • Traditional time-series models predict prices but lack explainability.
  • General-purpose LLMs can write natural-language analysis, but struggle to turn it into disciplined, executable trades.

Trading-R1 is a financially-aware reasoning model that closes this gap. It combines strategic thinking and planning for comprehensive thesis composition, facts-grounded analysis, and volatility-adjusted decision making. It aligns reasoning with trading principles through supervised fine-tuning (SFT) and reinforcement learning (RL) under a three-stage, easy-to-hard curriculum.

Contributions

We make four main contributions:

  1. Tauric-TR1-DB, a large-scale financial reasoning corpus. 100k filtered, high-quality samples spanning 18 months (January 2024 – May 2025) across 14 major tickers. The corpus integrates the heterogeneous data used by real traders: technical market data, fundamentals, news, insider sentiment, and macroeconomic indicators.
  2. Reverse chain-of-thought distillation. Proprietary reasoning models return conclusions without their intermediate steps. We reconstruct plausible reasoning traces from those black-box outputs and use them as supervision, so the model learns to write concise, interpretable investment theses.
  3. Reinforcement learning for execution-grade decisions. Trade recommendations are cast as an RL problem over the standard five-tier scale (Strong Buy, Buy, Hold, Sell, Strong Sell), with volatility-adjusted labels used as rewards.
  4. Trading-R1, a financial reasoning LLM for trading. Trained across diverse assets and market regimes (bull and bear), it produces both high-quality analyses and actionable trade recommendations.

Method

Because language models are conditional autoregressive generators, decision quality depends on two coupled priors:

  • The external prior: the input context.
  • The internal prior: the model's own intermediate reasoning.

Trading-R1 controls both. It curates high signal-to-noise, finance-grounded inputs, and it scaffolds how the model reasons toward a decision.

Volatility-driven labels

Rather than predict noisy exact prices, we discretize the output space into five actions. Labels come from a multi-horizon, volatility-aware procedure:

  1. Compute forward returns over 3-, 7-, and 15-day horizons.
  2. Normalize each by rolling 20-period volatility into Sharpe-like signals.
  3. Combine the horizons with weights 0.3, 0.5, and 0.2.
  4. Map the result to actions via asymmetric percentile thresholds that preserve the market's long-term upward drift.

These proxy labels supply both SFT targets and RL rewards, without expensive manual annotation.

Three-stage curriculum

SFT and reinforcement fine-tuning (RFT) are interleaved across three stages. Each stage is warm-started with SFT to set structural priors, then refined with RFT via task-specific rewards:

  1. Structure: reward professional thesis structure and consistent, XML-tagged formatting.
  2. Claims: reward claims supported by direct citations and quotations from the input, reducing hallucination.
  3. Decision: reward market-aligned recommendations using the volatility-aware labels.

This progression stabilizes intermediate reasoning and mitigates error compounding. The model first learns the form of professional analysis, then grounds it in evidence, and finally makes market-driven decisions.

Experiments

Setup

  • Inputs and outputs: Trading-R1 processes multi-dimensional inputs (20–30k tokens) and generates full investment theses (6–8k tokens).
  • Training: SFT uses LoRA. The pipeline was trained on 8×H100 (SFT) and 8×H200 (RL) servers.
  • Evaluation: a strictly causal, out-of-sample backtest (June–August 2024, held out from training).
  • Metrics: cumulative return (CR), Sharpe ratio (SR), hit rate (HR), and maximum drawdown (MDD).

Baselines

We compare against three families of models, plus ablations:

  • Small models: Qwen-4B, GPT-4.1-nano/mini.
  • Larger LLMs: GPT-4.1, LLaMA-3.3, LLaMA-Scout, Qwen3-32B.
  • Reasoning models: DeepSeek, O3-mini, O4-mini.
  • Ablations: SFT-only and RL-only variants.

Results

Across the evaluated equities and ETFs, Trading-R1 delivers the best balance of risk-adjusted return and drawdown, consistently ranking among the top models. Highlights:

  • It outperforms GPT-4.1 on AAPL (Sharpe 1.80 vs 1.24) while holding a lower drawdown.
  • It attains leading hit rates (70.0% on NVDA, 64.0% on SPY).

The overall hierarchy is SLM < RLM < LLM < Trading-SFT ≈ Trading-RFT < Trading-R1. This underscores that both scale and specialized reasoning matter for trading.

Trading-R1 (full model), reported metrics:

AssetCR% ↑SR ↑HR% ↑MDD% ↓
NVDA8.082.7270.03.80
AAPL5.821.8063.63.68
MSFT2.380.8760.41.90
AMZN5.391.7263.03.20
META5.120.8650.04.65
SPY3.341.6064.01.52

One finding stands out: off-the-shelf reasoning models often underperform general LLMs on trading. Their unguided, lengthy reasoning drifts away from market-relevant evidence. Trading-R1 deliberately prioritizes readability, interpretability, and structured argumentation over marginal metric gains, a trade-off we consider essential for real-world decision support.

Positioning

Trading-R1 is best suited for research support and structured analysis generation for financial professionals. It augments human decision-making where structured reasoning and interpretability are valued, rather than replacing it.

Future work targets real-time deployment, more sample-efficient offline RL, and additional data modalities. The Trading-R1 Terminal is released on GitHub.

Citation

@article{xiao2025trading,
  title={Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning},
  author={Xiao, Yijia and Sun, Edward and Chen, Tong and Wu, Fang and Luo, Di and Wang, Wei},
  journal={arXiv preprint arXiv:2509.11420},
  year={2025}
}
Share