Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

Existing benchmarks for tool-augmented language models (TaLMs) lack fine-grained control over task difficulty and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks to stress-test TaLMs. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where models must infer the correct sequence of calls to compute a target value. FuncBenchGen allows precise control over task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding pretraining/test-time leakage. Our evaluation demonstrates reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other available models. Performance declines sharply as dependency depth increases. Furthermore, connected distractors -- irrelevant functions sharing type-compatible variables with relevant functions -- prove especially difficult to handle. Also, strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding an improvement in success rate from 62.5% to 81.3% for GPT-5.

翻译：现有工具增强语言模型（TaLMs）的基准测试缺乏对任务难度的细粒度控制，并且仍然容易受到数据污染的影响。我们提出了FuncBenchGen，一个统一的、无污染的框架，通过生成合成的多步工具使用任务来对TaLMs进行压力测试。其核心思想是将工具使用视为在隐藏的函数依赖有向无环图（DAG）上的遍历，模型必须推断出正确的调用序列以计算目标值。FuncBenchGen允许精确控制任务难度（例如，图的大小、依赖深度和干扰函数），同时避免了预训练/测试时的数据泄露。我们的评估表明，经过推理优化的模型持续优于通用模型，其中GPT-5显著优于其他可用模型。随着依赖深度的增加，性能急剧下降。此外，关联干扰函数——即与相关函数共享类型兼容变量但无关的函数——被证明尤其难以处理。同时，强大的模型经常生成语法上有效的函数调用，但在多轮工具使用中传播错误或过时的参数值，这揭示了大语言模型在多轮工具使用中状态跟踪的脆弱性。受此观察启发，我们引入了一种简单的缓解策略，即在每一步向智能体明确重述先前的变量值。令人惊讶的是，这种轻量级的改变为所有模型带来了显著的性能提升。例如，GPT-5的成功率从62.5%提高到了81.3%。