Pretrained natural language processing (NLP) models have achieved high overall performance, but they still make systematic errors. Instead of manual error analysis, research on slice detection models (SDM), which automatically identify underperforming groups of datapoints, has caught escalated attention in Computer Vision for both understanding model behaviors and providing insights for future model training and designing. However, little research on SDM and quantitative evaluation of their effectiveness have been conducted on NLP tasks. Our paper fills the gap by proposing a benchmark named "Discover, Explain, Improve (DEIM)" for classification NLP tasks along with a new SDM Edisa. Edisa discovers coherent and underperforming groups of datapoints; DEIM then unites them under human-understandable concepts and provides comprehensive evaluation tasks and corresponding quantitative metrics. The evaluation in DEIM shows that Edisa can accurately select error-prone datapoints with informative semantic features that summarize error patterns. Detecting difficult datapoints directly boosts model performance without tuning any original model parameters, showing that discovered slices are actionable for users.
翻译:预训练自然语言处理(NLP)模型已取得高整体性能,但仍存在系统性错误。不同于人工错误分析,关于切片检测模型(Slice Detection Models, SDM)的研究在计算机视觉领域已引起广泛关注——该方法可自动识别性能欠佳的数据点子集,既有助于理解模型行为,也能为未来模型训练与设计提供启示。然而,针对NLP任务的SDM研究及其量化评估工作尚显匮乏。本文通过提出面向分类NLP任务的基准框架"发现、解释、改进(DEIM)"及新型SDM模型Edisa填补了这一空白。Edisa可发现语义连贯且性能欠佳的数据点群组,DEIM则将这些群组归纳为人类可理解的概念,并提供综合性评估任务及对应量化指标。DEIM框架评估表明,Edisa能精准选取具有信息性语义特征(可概括错误模式)的高出错数据点。无需调整原始模型参数,仅通过检测困难数据点即可直接提升模型性能,验证了所发现切片对用户而言具有可操作性。