Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

from arxiv, 27 pages, 5 figures, 9 tables | Code and data: https://github.com/ibm-client-engineering/output-drift-financial-llms | To appear in the 2nd ICLR Workshop on Advances in Financial AI: Towards Agentic and Responsible Systems (ICLR 2026)

LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, many deployments fail to return consistent results. We introduce the Determinism-Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism, decision determinism, and evidence-conditioned faithfulness in tool-using agents deployed in financial services. Across 4,700+ agentic runs (7 models, 4 providers, 3 financial benchmarks with 50 cases each at T=0.0), we find that decision determinism and task accuracy are not detectably correlated (r = -0.11, 95% CI [-0.49, 0.31], p = 0.63, n = 21 configurations): models can be deterministic without being accurate, and accurate without being deterministic. Because neither metric predicts the other in our sample, both must be measured independently, which is precisely what DFAH provides. Small models (7-20B) achieve near-perfect determinism through rigid pattern matching at the cost of accuracy (20-42%), while frontier models show moderate determinism (50-96%) with variable accuracy. No model achieves both perfect determinism and high accuracy, supporting DFAH's multi-dimensional measurement approach. We provide three financial benchmarks (compliance triage, portfolio constraints, and DataOps exceptions; 50 cases each) together with an open-source stress-test harness. Across these benchmarks and DFAH evaluation settings, Tier 1 models with schema-first architectures achieved determinism levels consistent with audit replay requirements.

翻译：LLM智能体在监管审计复现方面面临挑战：当要求使用相同输入重现被标记的交易决策时，许多部署系统无法返回一致结果。本文提出确定性-忠实性保障框架（DFAH），用于衡量金融服务领域工具调用智能体的轨迹确定性、决策确定性和证据条件忠实性。通过对4,700余次智能体运行（涵盖7个模型、4个服务商、3个金融基准各50个案例，温度参数T=0.0）的分析，我们发现决策确定性与任务准确性不存在可检测相关性（r = -0.11，95%置信区间[-0.49, 0.31]，p = 0.63，n = 21种配置）：模型可能具有确定性但不准确，也可能准确但不具确定性。由于两个指标在样本中互不预测，必须独立测量二者——这正是DFAH所提供的功能。小型模型（7-20B参数量）通过僵化的模式匹配实现近乎完美的确定性（代价是准确率仅20-42%），而前沿模型表现出中等确定性（50-96%）和波动性准确率。所有模型均未同时实现完美确定性与高准确率，这支持了DFAH的多维测量方法。我们提供三个金融基准测试（合规审查、投资组合约束、DataOps异常处理各50个案例）及开源压力测试框架。在这些基准和DFAH评估环境下，采用模式优先架构的一级模型达到了符合审计复现要求的确定性水平。