In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioural test suite, leading to overestimation and misrepresentation of model performance -- one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different granularity levels. We optimise behaviour-specific loss functions and evaluate models on several partitions of the behavioural test suite controlled to leave out specific phenomena. An aggregate score measures generalisation to unseen functionalities (or overfitting). We use BeLUGA to examine three representative NLP tasks (sentiment analysis, paraphrase identification and reading comprehension) and compare the impact of a diverse set of regularisation and domain generalisation methods on generalisation performance.
翻译:在行为测试中,标准评估设置(使用保留测试集)中代表性不足的系统功能通过受控的输入-输出对进行验证。在训练期间优化行为测试性能(行为学习)可改善对独立同分布数据中未充分体现现象的覆盖,并可能产生看似更鲁棒的模型。然而,存在模型可能狭隘地捕捉行为测试套件中的虚假相关性,导致模型性能被高估和错误表征的风险——这正是传统评估的原始缺陷之一。本研究提出BeLUGA分析方法,用于评估考虑不同粒度维度泛化的行为学习。我们优化行为特定损失函数,并在保留特定现象的控制条件下,对行为测试套件的多个分区评估模型。聚合分数衡量模型对未见功能(或过拟合)的泛化能力。我们使用BeLUGA检验三个代表性NLP任务(情感分析、释义识别和阅读理解),并比较多种正则化与领域泛化方法对泛化性能的影响。