Large Language Models (LLMs) are rapidly becoming ubiquitous both as stand-alone tools and as components of current and future software systems. To enable usage of LLMs in the high-stake or safety-critical systems of 2030, they need to undergo rigorous testing. Software Engineering (SE) research on testing Machine Learning (ML) components and ML-based systems has systematically explored many topics such as test input generation and robustness. We believe knowledge about tools, benchmarks, research and practitioner views related to LLM testing needs to be similarly organized. To this end, we present a taxonomy of LLM testing topics and conduct preliminary studies of state of the art and practice approaches to research, open-source tools and benchmarks for LLM testing, mapping results onto this taxonomy. Our goal is to identify gaps requiring more research and engineering effort and inspire a clearer communication between LLM practitioners and the SE research community.
翻译:大型语言模型(LLMs)正迅速普及,既作为独立工具,也作为当前及未来软件系统的组成部分。为使LLMs能在2030年高风险或安全关键系统中得到应用,必须对其进行严格测试。软件工程(SE)领域针对机器学习(ML)组件及基于ML的系统测试已系统性地探索了测试输入生成、鲁棒性等诸多主题。我们认为,与LLM测试相关的工具、基准、研究及实践观点同样需要系统梳理。为此,我们提出了LLM测试主题的分类体系,并对LLM测试领域的前沿研究方法、开源工具及基准进行了初步调研,将结果映射至该分类体系。我们的目标是识别需要更多研究与工程投入的空白领域,并促进LLM实践者与SE研究社群之间更清晰的交流。