NLP models often degrade in performance when real world data distributions differ markedly from training data. However, existing dataset drift metrics in NLP have generally not considered specific dimensions of linguistic drift that affect model performance, and they have not been validated in their ability to predict model performance at the individual example level, where such metrics are often used in practice. In this paper, we propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift. These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies (e.g. lexical semantic change). We propose interpretable metrics for all three drift dimensions, and we modify past performance prediction methods to predict model performance at both the example and dataset level for English sentiment classification and natural language inference. We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies (mean 16.8% root mean square error decrease), particularly when compared to popular fine-tuned embedding distances (mean 47.7% error decrease). Fine-tuned embedding distances are much more effective at ranking individual examples by expected performance, but decomposing into vocabulary, structural, and semantic drift produces the best example rankings of all considered model-agnostic drift metrics (mean 6.7% ROC AUC increase).
翻译:自然语言处理模型在现实世界数据分布与训练数据显著不同时,其性能往往会下降。然而,自然语言处理中现有的数据集漂移指标通常未考虑影响模型性能的语言漂移的具体维度,也未能验证其预测单个样本层面模型性能的能力(而这类指标在实践中常在此层面使用)。本文提出语言数据集漂移的三个维度:词汇漂移、结构漂移和语义漂移。这些维度分别对应内容词频率差异、句法差异以及词频无法捕捉的语义变化(例如词汇语义变化)。我们为所有三个漂移维度提出了可解释的指标,并改进了以往的性能预测方法,以在样本和数据集层面预测英语情感分类和自然语言推理任务的模型性能。研究发现,我们的漂移指标在预测领域外模型准确率方面比现有指标更有效(均方根误差平均降低16.8%),尤其与流行的微调嵌入距离相比(误差平均降低47.7%)。微调嵌入距离在根据预期性能对单个样本进行排序方面更为有效,但将漂分解为词汇、结构和语义三个维度后,在所有与模型无关的漂移指标中产生了最佳的样本排序效果(ROC AUC平均提高6.7%)。