A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks

Large Language Models (LLMs) are rapidly becoming ubiquitous both as stand-alone tools and as components of current and future software systems. To enable usage of LLMs in the high-stake or safety-critical systems of 2030, they need to undergo rigorous testing. Software Engineering (SE) research on testing Machine Learning (ML) components and ML-based systems has systematically explored many topics such as test input generation and robustness. We believe knowledge about tools, benchmarks, research and practitioner views related to LLM testing needs to be similarly organized. To this end, we present a taxonomy of LLM testing topics and conduct preliminary studies of state of the art and practice approaches to research, open-source tools and benchmarks for LLM testing, mapping results onto this taxonomy. Our goal is to identify gaps requiring more research and engineering effort and inspire a clearer communication between LLM practitioners and the SE research community.

翻译：大型语言模型（LLMs）正迅速普及，既作为独立工具，也作为当前及未来软件系统的组成部分。为使LLMs能在2030年高风险或安全关键系统中得到应用，必须对其进行严格测试。软件工程（SE）领域针对机器学习（ML）组件及基于ML的系统测试已系统性地探索了测试输入生成、鲁棒性等诸多主题。我们认为，与LLM测试相关的工具、基准、研究及实践观点同样需要系统梳理。为此，我们提出了LLM测试主题的分类体系，并对LLM测试领域的前沿研究方法、开源工具及基准进行了初步调研，将结果映射至该分类体系。我们的目标是识别需要更多研究与工程投入的空白领域，并促进LLM实践者与SE研究社群之间更清晰的交流。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日