Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes -- action correctness, execution efficiency, process quality, and output quality -- enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle with information utilization and final answer quality, exposing a critical gap between invoking the right tools and reasoning effectively over their outputs. To move beyond diagnosis, we construct FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, containing 8,196 curated trajectories with tool-augmented contexts and preference pairs. We fine-tune Qwen-3.5-9B using supervised fine-tuning followed by direct preference optimization (DPO) and show that training on FinTrace-Training consistently improves intermediate reasoning metrics, with DPO more effectively suppressing failure modes. However, end-to-end answer quality remains a bottleneck, indicating that trajectory-level improvements do not yet fully propagate to final output quality.
翻译:近期研究表明,工具调用能力使大语言模型能够与外部环境交互,以处理长周期金融任务。现有基准虽已开始评估金融工具调用,但多聚焦于有限场景,且依赖无法捕捉轨迹级推理质量的调用级指标。为弥补这一空白,我们提出FinTrace基准,包含800条经专家标注的轨迹,覆盖34个真实金融任务类别及多个难度层级。FinTrace采用基于评分标准的评估协议,沿四个维度——动作正确性、执行效率、过程质量与输出质量——组织九项指标,实现对大语言模型工具调用行为的细粒度评估。对13个模型的评估表明,前沿模型在工具选择上表现强劲,但所有模型在信息利用与最终答案质量上均存在困难,暴露出"调用正确工具"与"有效推理其输出"之间的关键差距。为推动诊断走向改进,我们构建了FinTrace-Training——首个针对金融工具调用的轨迹级偏好数据集,包含8,196条经整理的轨迹,配备工具增强上下文与偏好对。我们采用有监督微调结合直接偏好优化对Qwen-3.5-9B进行微调,结果显示在FinTrace-Training上训练能持续提升中间推理指标,且DPO更有效地抑制失败模式。然而,端到端答案质量仍为瓶颈,表明轨迹级改进尚未完全传导至最终输出质量。