Large Language Model (LLM)-based applications are increasingly deployed across various domains, including customer service, education, and mobility. However, these systems are prone to inaccurate, fictitious, or harmful responses, and their vast, high-dimensional input space makes systematic testing particularly challenging. To address this, we present STELLAR, an automated search-based testing framework for LLM-based applications that systematically uncovers text inputs leading to inappropriate system responses. Our framework models test generation as an optimization problem and discretizes the input space into stylistic, content-related, and perturbation features. Unlike prior work that focuses on prompt optimization or coverage heuristics, our work employs evolutionary optimization to dynamically explore feature combinations that are more likely to expose failures. We evaluate STELLAR on three LLM-based conversational question-answering systems. The first focuses on safety, benchmarking both public and proprietary LLMs against malicious or unsafe prompts. The second and third target navigation, using an open-source and an industrial retrieval-augmented system for in-vehicle venue recommendations. Overall, STELLAR exposes up to 4.3 times (average 2.5 times) more failures than the existing baseline approaches.
翻译:基于大型语言模型(LLM)的应用正日益广泛地部署于客户服务、教育和移动出行等多个领域。然而,这些系统容易产生不准确、虚构或有害的回应,且其庞大而高维的输入空间使得系统性测试尤为困难。为此,我们提出了STELLAR,一种面向LLM应用的自动化基于搜索的测试框架,能够系统地发现导致系统不当回应的文本输入。本框架将测试生成建模为一个优化问题,并将输入空间离散化为风格、内容相关和扰动三类特征。与以往专注于提示优化或覆盖启发式方法的研究不同,本工作采用进化优化技术动态探索更可能暴露故障的特征组合。我们在三个基于LLM的对话问答系统上评估了STELLAR。第一个系统聚焦安全性,针对恶意或不安全提示对公开和专有LLM进行基准测试。第二和第三个系统面向导航功能,分别采用开源和工业级检索增强系统进行车载场所推荐。总体而言,STELLAR暴露的故障数量最高可达现有基线方法的4.3倍(平均2.5倍)。