Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs' quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.
翻译:大语言模型(LLMs)已在众多领域展现出强大能力,然而其在金融量化任务上的评估仍较为零散且主要局限于知识驱动的问答。我们提出了QuantEval,这是一个从量化金融三个核心维度评估LLMs的基准:基于知识的问答、量化数学推理以及量化策略编程。与以往的金融基准不同,QuantEval集成了一个CTA风格的回测框架,能够执行模型生成的策略并使用金融绩效指标进行评估,从而实现对量化编程能力更贴近实际的评估。我们对若干先进的开源与专有大语言模型进行了评估,发现其与人类专家相比仍存在显著差距,尤其在推理与策略编程方面。最后,我们在领域对齐数据上进行了大规模监督微调与强化学习实验,证明了模型性能的持续提升。我们希望QuantEval能够推动大语言模型在量化金融能力方面的研究,并加速其在真实交易工作流中的实际应用。我们还发布了完整的确定性回测配置(资产范围、成本模型与指标定义),以确保严格的可复现性。