From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation--evaluation--feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textsc{QuantEvolver}, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textsc{QuantEvolver} converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textsc{QuantEvolver} constructs high-quality seed factors, builds diverse seed--time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textsc{QuantEvolver}, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.

翻译：现代量化交易日益依赖系统化模型从大规模金融数据中提取预测信号，其中Alpha因子发现是将市场观测转化为可交易信号的核心环节。近期基于大语言模型（LLM）的方法在自动化因子生成方面展现出潜力，但多数方法仍依赖提示层级的生成-评估-反馈循环进行迭代优化。随着循环过程延长，历史候选因子与反馈的重复累加会导致上下文爆炸、推理成本增加、有效信息稀释以及反馈漂移。此外，这些方法通常依赖规模极大的LLM，其稳定的生成偏好可能导致结构相似的表达式、冗余候选因子及搜索停滞。为解决上述局限，我们提出基于强化学习微调的自我演进式Alpha因子发现框架QuantEvolver。不同于在提示中累积反馈，QuantEvolver将可执行的量化评估转化为策略更新，使Miner LLM通过参数学习内化历史优化经验。具体而言，QuantEvolver构建高质量种子因子，建立多样化的种子-时间窗口训练任务，生成可执行的因子DSL表达式，通过制度回测进行评估，并利用多样性-互补性奖励优化Miner LLM。训练过程中，高质量因子持续累积至因子数据库（Mined Factor Database），最终形成发现的因子库。在三个真实市场基准上的大量实验表明，QuantEvolver能够持续提升每项任务的主要评估指标，相较于现有基于LLM的Alpha因子发现基线方法产生更高质量且更互补的因子池。