The development of modern NLP applications often relies on various benchmark datasets containing plenty of manually labeled tests to evaluate performance. While constructing datasets often costs many resources, the performance on the held-out data may not properly reflect their capability in real-world application scenarios and thus cause tremendous misunderstanding and monetary loss. To alleviate this problem, in this paper, we propose an automated test generation method for detecting erroneous behaviors of various NLP applications. Our method is designed based on the sentence parsing process of classic linguistics, and thus it is capable of assembling basic grammatical elements and adjuncts into a grammatically correct test with proper oracle information. We implement this method into NLPLego, which is designed to fully exploit the potential of seed sentences to automate the test generation. NLPLego disassembles the seed sentence into the template and adjuncts and then generates new sentences by assembling context-appropriate adjuncts with the template in a specific order. Unlike the taskspecific methods, the tests generated by NLPLego have derivation relations and different degrees of variation, which makes constructing appropriate metamorphic relations easier. Thus, NLPLego is general, meaning it can meet the testing requirements of various NLP applications. To validate NLPLego, we experiment with three common NLP tasks, identifying failures in four state-of-art models. Given seed tests from SQuAD 2.0, SST, and QQP, NLPLego successfully detects 1,732, 5301, and 261,879 incorrect behaviors with around 95.7% precision in three tasks, respectively.
翻译:现代自然语言处理(NLP)应用的开发通常依赖包含大量人工标注测试样本的各类基准数据集来评估性能。然而,构建数据集往往耗费大量资源,且基于留出数据的性能可能无法准确反映模型在真实应用场景中的能力,从而引发严重误解与经济损失。为解决该问题,本文提出一种面向检测各类NLP应用错误行为的自动化测试生成方法。该方法基于经典语言学中的句子解析过程设计,因此能够将基础语法成分与附加成分组合成语法正确且具备适当预言信息(oracle information)的测试用例。我们将该方法实现为NLPLego系统,其旨在充分挖掘种子句的潜力以自动化生成测试用例。NLPLego将种子句拆解为模板与附加成分,随后通过按特定顺序组合上下文适配的附加成分与模板生成新语句。与任务特定方法不同,NLPLego生成的测试存在衍生关系且具有不同程度的变异性,这使得构建恰当的蜕变关系(metamorphic relations)更为简便。因此,NLPLego具有通用性,可满足多种NLP应用的测试需求。为验证NLPLego的有效性,我们针对三种常见NLP任务开展实验,成功检测出四个前沿模型中的错误行为。基于来自SQuAD 2.0、SST和QQP的种子测试,NLPLego在这三项任务中分别检测出1,732、5,301和261,879个错误行为,平均精度达约95.7%。