FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Yupeng Cao,Haohang Li,Weijin Liu,Wenbo Cao,Anke Xu,Lingfei Qian,Xueqing Peng,Minxue Tang,Zhiyuan Yao,Jimin Huang,K. P. Subbalakshmi,Zining Zhu,Jordan W. Suchow,Yangyang Yu

Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes -- action correctness, execution efficiency, process quality, and output quality -- enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle with information utilization and final answer quality, exposing a critical gap between invoking the right tools and reasoning effectively over their outputs. To move beyond diagnosis, we construct FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, containing 8,196 curated trajectories with tool-augmented contexts and preference pairs. We fine-tune Qwen-3.5-9B using supervised fine-tuning followed by direct preference optimization (DPO) and show that training on FinTrace-Training consistently improves intermediate reasoning metrics, with DPO more effectively suppressing failure modes. However, end-to-end answer quality remains a bottleneck, indicating that trajectory-level improvements do not yet fully propagate to final output quality.

翻译：近期研究表明，工具调用能力使大语言模型能够与外部环境交互，以处理长周期金融任务。现有基准虽已开始评估金融工具调用，但多聚焦于有限场景，且依赖无法捕捉轨迹级推理质量的调用级指标。为弥补这一空白，我们提出FinTrace基准，包含800条经专家标注的轨迹，覆盖34个真实金融任务类别及多个难度层级。FinTrace采用基于评分标准的评估协议，沿四个维度——动作正确性、执行效率、过程质量与输出质量——组织九项指标，实现对大语言模型工具调用行为的细粒度评估。对13个模型的评估表明，前沿模型在工具选择上表现强劲，但所有模型在信息利用与最终答案质量上均存在困难，暴露出"调用正确工具"与"有效推理其输出"之间的关键差距。为推动诊断走向改进，我们构建了FinTrace-Training——首个针对金融工具调用的轨迹级偏好数据集，包含8,196条经整理的轨迹，配备工具增强上下文与偏好对。我们采用有监督微调结合直接偏好优化对Qwen-3.5-9B进行微调，结果显示在FinTrace-Training上训练能持续提升中间推理指标，且DPO更有效地抑制失败模式。然而，端到端答案质量仍为瓶颈，表明轨迹级改进尚未完全传导至最终输出质量。