Multi-turn tool-calling LLMs (models capable of invoking external APIs or tools across several user turns) have emerged as a key feature in modern AI assistants, enabling extended dialogues from benign tasks to critical business, medical, and financial operations. Yet implementing multi-turn pipelines remains difficult for many safety-critical industries due to ongoing concerns regarding model resilience. While standardized benchmarks such as the Berkeley Function-Calling Leaderboard (BFCL) have underpinned confidence concerning advanced function-calling models (like Salesforce's xLAM V2), there is still a lack of visibility into multi-turn conversation-level robustness, especially given their exposure to real-world systems. In this paper, we introduce Assertion-Conditioned Compliance (A-CC), a novel evaluation paradigm for multi-turn function-calling dialogues. A-CC provides holistic metrics that evaluate a model's behavior when confronted with misleading assertions originating from two distinct vectors: (1) user-sourced assertions (USAs), which measure sycophancy toward plausible but misinformed user beliefs, and (2) function-sourced assertions (FSAs), which measure compliance with plausible but contradictory system policies (e.g., stale hints from unmaintained tools). Our results show that models are highly vulnerable to both USA sycophancy and FSA policy conflicts, confirming A-CC as a critical, latent vulnerability in deployed agents.
翻译:多轮工具调用大语言模型(能够在多个用户轮次中调用外部API或工具的模型)已成为现代AI助手的关键特性,使得对话能够从良性任务扩展到关键的业务、医疗和金融操作。然而,由于对模型鲁棒性的持续担忧,在许多安全关键行业中实现多轮对话流程仍然困难重重。尽管标准化基准测试(如伯克利函数调用排行榜)增强了对先进函数调用模型(例如Salesforce的xLAM V2)的信心,但在多轮对话层面的鲁棒性方面仍缺乏可见性,尤其是考虑到它们在实际系统中的暴露程度。本文提出断言条件遵从性,一种用于评估多轮函数调用对话的新范式。A-CC提供了一套整体性指标,用于评估模型在面对来自两个不同向量的误导性断言时的行为:(1)用户来源断言,用于衡量模型对看似合理但存在错误信息的用户信念的迎合程度;(2)函数来源断言,用于衡量模型对看似合理但相互矛盾的系统策略(例如来自未维护工具的过时提示)的遵从程度。我们的结果表明,模型对用户来源断言的迎合和函数来源断言的政策冲突均表现出高度脆弱性,这证实了A-CC是已部署代理中一个关键且潜在的漏洞。