Reinforcement Learning
Reasoning
LLM

Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning

Sep 16, 2025Tauric Research

The Problem

Developing structured reasoning on par with professional financial analysts remains a central challenge in AI for finance, where markets demand interpretability and trust.

Existing approaches fall short in two different ways:

Traditional time-series models predict prices but lack explainability.
General-purpose LLMs can write natural-language analysis, but struggle to turn it into disciplined, executable trades.

Trading-R1 is a financially-aware reasoning model that closes this gap. It combines strategic thinking and planning for comprehensive thesis composition, facts-grounded analysis, and volatility-adjusted decision making. It aligns reasoning with trading principles through supervised fine-tuning (SFT) and reinforcement learning (RL) under a three-stage, easy-to-hard curriculum.

Contributions

We make four main contributions:

Tauric-TR1-DB, a large-scale financial reasoning corpus. 100k filtered, high-quality samples spanning 18 months (January 2024 – May 2025) across 14 major tickers. The corpus integrates the heterogeneous data used by real traders: technical market data, fundamentals, news, insider sentiment, and macroeconomic indicators.
Reverse chain-of-thought distillation. Proprietary reasoning models return conclusions without their intermediate steps. We reconstruct plausible reasoning traces from those black-box outputs and use them as supervision, so the model learns to write concise, interpretable investment theses.
Reinforcement learning for execution-grade decisions. Trade recommendations are cast as an RL problem over the standard five-tier scale (Strong Buy, Buy, Hold, Sell, Strong Sell), with volatility-adjusted labels used as rewards.
Trading-R1, a financial reasoning LLM for trading. Trained across diverse assets and market regimes (bull and bear), it produces both high-quality analyses and actionable trade recommendations.

Method

Because language models are conditional autoregressive generators, decision quality depends on two coupled priors:

The external prior: the input context.
The internal prior: the model's own intermediate reasoning.

Trading-R1 controls both. It curates high signal-to-noise, finance-grounded inputs, and it scaffolds how the model reasons toward a decision.

Volatility-driven labels

Rather than predict noisy exact prices, we discretize the output space into five actions. Labels come from a multi-horizon, volatility-aware procedure:

Compute forward returns over 3-, 7-, and 15-day horizons.
Normalize each by rolling 20-period volatility into Sharpe-like signals.
Combine the horizons with weights 0.3, 0.5, and 0.2.
Map the result to actions via asymmetric percentile thresholds that preserve the market's long-term upward drift.

These proxy labels supply both SFT targets and RL rewards, without expensive manual annotation.

Three-stage curriculum

SFT and reinforcement fine-tuning (RFT) are interleaved across three stages. Each stage is warm-started with SFT to set structural priors, then refined with RFT via task-specific rewards:

Structure: reward professional thesis structure and consistent, XML-tagged formatting.
Claims: reward claims supported by direct citations and quotations from the input, reducing hallucination.
Decision: reward market-aligned recommendations using the volatility-aware labels.

This progression stabilizes intermediate reasoning and mitigates error compounding. The model first learns the form of professional analysis, then grounds it in evidence, and finally makes market-driven decisions.

Experiments

Setup

Inputs and outputs: Trading-R1 processes multi-dimensional inputs (20–30k tokens) and generates full investment theses (6–8k tokens).
Training: SFT uses LoRA. The pipeline was trained on 8×H100 (SFT) and 8×H200 (RL) servers.
Evaluation: a strictly causal, out-of-sample backtest (June–August 2024, held out from training).
Metrics: cumulative return (CR), Sharpe ratio (SR), hit rate (HR), and maximum drawdown (MDD).

Baselines

We compare against three families of models, plus ablations:

Small models: Qwen-4B, GPT-4.1-nano/mini.
Larger LLMs: GPT-4.1, LLaMA-3.3, LLaMA-Scout, Qwen3-32B.
Reasoning models: DeepSeek, O3-mini, O4-mini.
Ablations: SFT-only and RL-only variants.

Results

Across the evaluated equities and ETFs, Trading-R1 delivers the best balance of risk-adjusted return and drawdown, consistently ranking among the top models. Highlights:

It outperforms GPT-4.1 on AAPL (Sharpe 1.80 vs 1.24) while holding a lower drawdown.
It attains leading hit rates (70.0% on NVDA, 64.0% on SPY).

The overall hierarchy is SLM < RLM < LLM < Trading-SFT ≈ Trading-RFT < Trading-R1. This underscores that both scale and specialized reasoning matter for trading.

Trading-R1 (full model), reported metrics:

Asset	CR% ↑	SR ↑	HR% ↑	MDD% ↓
NVDA	8.08	2.72	70.0	3.80
AAPL	5.82	1.80	63.6	3.68
MSFT	2.38	0.87	60.4	1.90
AMZN	5.39	1.72	63.0	3.20
META	5.12	0.86	50.0	4.65
SPY	3.34	1.60	64.0	1.52

One finding stands out: off-the-shelf reasoning models often underperform general LLMs on trading. Their unguided, lengthy reasoning drifts away from market-relevant evidence. Trading-R1 deliberately prioritizes readability, interpretability, and structured argumentation over marginal metric gains, a trade-off we consider essential for real-world decision support.

Positioning

Trading-R1 is best suited for research support and structured analysis generation for financial professionals. It augments human decision-making where structured reasoning and interpretability are valued, rather than replacing it.

Future work targets real-time deployment, more sample-efficient offline RL, and additional data modalities. The Trading-R1 Terminal is released on GitHub.

Citation

@article{xiao2025trading,
  title={Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning},
  author={Xiao, Yijia and Sun, Edward and Chen, Tong and Wu, Fang and Luo, Di and Wang, Wei},
  journal={arXiv preprint arXiv:2509.11420},
  year={2025}
}