Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy

Large Language Models (LLMs) and Multi-Agent LLMs (MALLMs) introduce non-determinism unlike traditional or machine learning software, requiring new approaches to verifying correctness beyond simple output comparisons or statistical accuracy over test datasets. This paper presents a taxonomy for LLM test case design, informed by research literature and our experience. Each facet is exemplified, and we conduct an LLM-assisted analysis of six open-source testing frameworks, perform a sensitivity study of an agent-based system across different model configurations, and provide working examples contrasting atomic and aggregated test cases. We identify key variation points that impact test correctness and highlight open challenges that the research, industry, and open-source communities must address as LLMs become integral to software systems. Our taxonomy defines four facets of LLM test case design, addressing ambiguity in both inputs and outputs while establishing best practices. It distinguishes variability in goals, the system under test, and inputs, and introduces two key oracle types: atomic and aggregated. Our findings reveal that current tools treat test executions as isolated events, lack explicit aggregation mechanisms, and inadequately capture variability across model versions, configurations, and repeated runs. This highlights the need for viewing correctness as a distribution of outcomes rather than a binary property, requiring closer collaboration between academia and practitioners to establish mature, variability-aware testing methodologies.

翻译：与传统软件或机器学习软件不同，大型语言模型（LLMs）与多智能体大型语言模型（MALLMs）引入了非确定性，这要求采用超越简单输出比较或测试数据集统计准确性的新方法来验证正确性。本文基于研究文献与实践经验，提出了一种针对LLM测试用例设计的分类法。我们对每个维度进行了示例说明，并开展了以下工作：对六个开源测试框架进行了LLM辅助分析；对基于智能体的系统在不同模型配置下进行了敏感性研究；提供了原子测试用例与聚合测试用例的对比实例。我们识别了影响测试正确性的关键变异点，并强调了在LLMs日益融入软件系统的背景下，研究界、工业界和开源社区必须应对的开放挑战。本分类法定义了LLM测试用例设计的四个维度，旨在处理输入与输出的模糊性，同时建立最佳实践。它区分了目标、被测系统及输入中的可变性，并引入了两种关键预言类型：原子型与聚合型。研究发现，现有工具将测试执行视为孤立事件，缺乏明确的聚合机制，且未能充分捕捉跨模型版本、配置及重复运行的可变性。这表明需要将正确性视为结果的分布而非二元属性，亟需学术界与实践者加强协作，以建立成熟且具备可变性感知能力的测试方法论。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日