TAI3：测试智能体在解读用户意图时的完整性 (TAI3: Testing Agent Integrity in Interpreting User Intent)

LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce TAI3, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, TAI3 generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, TAI3 maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that TAI3 effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, TAI3 generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.

翻译：LLM智能体正越来越多地通过自然语言指令调用API来自动化现实世界任务。尽管功能强大，它们却常常误解用户意图，导致智能体行为偏离用户预期目标，尤其是在外部工具包不断演变的背景下。传统软件测试假定结构化输入，因而难以处理自然语言的歧义性。我们提出了TAI3——一个以API为中心的压力测试框架，能够系统性地揭示LLM智能体中的意图完整性违规。与先前专注于固定基准或对抗性输入的研究不同，TAI3基于工具包文档生成真实任务，并通过定向变异在保持用户意图的同时暴露细微的智能体错误。为引导测试过程，我们提出语义分区方法，该方法根据工具包API参数及其等价类将自然语言任务组织成有意义的类别。在每个分区内，种子任务经过变异后由轻量级预测器进行排序，该预测器可估计触发智能体错误的可能性。为提升效率，TAI3维护具备数据类型感知的策略记忆模块，能够从历史案例中检索并适配有效的变异模式。在80个工具包API上的实验表明，TAI3能有效发现意图完整性违规，在错误暴露率和查询效率方面均显著优于基线方法。此外，TAI3使用较小规模的LLM进行测试生成时，能良好泛化至更强的目标模型，并适应跨领域持续演进的API。