Large Language Models (LLMs) and Multi-Agent LLMs (MALLMs) introduce non-determinism unlike traditional or machine learning software, requiring new approaches to verifying correctness beyond simple output comparisons or statistical accuracy over test datasets. This paper presents a taxonomy for LLM test case design, informed by research literature and our experience. Each facet is exemplified, and we conduct an LLM-assisted analysis of six open-source testing frameworks, perform a sensitivity study of an agent-based system across different model configurations, and provide working examples contrasting atomic and aggregated test cases. We identify key variation points that impact test correctness and highlight open challenges that the research, industry, and open-source communities must address as LLMs become integral to software systems. Our taxonomy defines four facets of LLM test case design, addressing ambiguity in both inputs and outputs while establishing best practices. It distinguishes variability in goals, the system under test, and inputs, and introduces two key oracle types: atomic and aggregated. Our findings reveal that current tools treat test executions as isolated events, lack explicit aggregation mechanisms, and inadequately capture variability across model versions, configurations, and repeated runs. This highlights the need for viewing correctness as a distribution of outcomes rather than a binary property, requiring closer collaboration between academia and practitioners to establish mature, variability-aware testing methodologies.
翻译:与传统软件或机器学习软件不同,大型语言模型(LLMs)与多智能体大型语言模型(MALLMs)引入了非确定性,这要求采用超越简单输出比较或测试数据集统计准确性的新方法来验证正确性。本文基于研究文献与实践经验,提出了一种针对LLM测试用例设计的分类法。我们对每个维度进行了示例说明,并开展了以下工作:对六个开源测试框架进行了LLM辅助分析;对基于智能体的系统在不同模型配置下进行了敏感性研究;提供了原子测试用例与聚合测试用例的对比实例。我们识别了影响测试正确性的关键变异点,并强调了在LLMs日益融入软件系统的背景下,研究界、工业界和开源社区必须应对的开放挑战。本分类法定义了LLM测试用例设计的四个维度,旨在处理输入与输出的模糊性,同时建立最佳实践。它区分了目标、被测系统及输入中的可变性,并引入了两种关键预言类型:原子型与聚合型。研究发现,现有工具将测试执行视为孤立事件,缺乏明确的聚合机制,且未能充分捕捉跨模型版本、配置及重复运行的可变性。这表明需要将正确性视为结果的分布而非二元属性,亟需学术界与实践者加强协作,以建立成熟且具备可变性感知能力的测试方法论。