Recent work in behavioral testing for natural language processing (NLP) models, such as Checklist, is inspired by related paradigms in software engineering testing. They allow evaluation of general linguistic capabilities and domain understanding, hence can help evaluate conceptual soundness and identify model weaknesses. However, a major challenge is the creation of test cases. The current packages rely on semi-automated approach using manual development which requires domain expertise and can be time consuming. This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques. It clusters the text representations to carefully construct meaningful groups and then apply prompting techniques to automatically generate Minimal Functionality Tests (MFT). The well-known Amazon Reviews corpus is used to demonstrate our approach. We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.
翻译:自然语言处理(NLP)模型的行为测试(如Checklist)近期研究受软件工程测试相关范式启发。此类方法能够评估模型的通用语言能力与领域理解,从而有助于评估概念合理性并识别模型弱点。然而,测试用例的创建面临重大挑战。现有工具包依赖半自动化的人工开发方法,需要领域专业知识且耗时较长。本文提出一种利用大语言模型与统计技术自动生成测试用例的方法:通过对文本表征进行聚类以构建有意义的组别,进而应用提示技术自动生成最小功能测试(MFT)。我们使用经典的亚马逊评论语料库验证所提方法,并分析了四种不同分类算法在行为测试中的表现特征,同时讨论了这些模型的局限性及优势。