Out-of-distribution (OOD) detection is essential for the reliable and safe deployment of machine learning systems in the real world. Great progress has been made over the past years. This paper presents the first review of recent advances in OOD detection with a particular focus on natural language processing approaches. First, we provide a formal definition of OOD detection and discuss several related fields. We then categorize recent algorithms into three classes according to the data they used: (1) OOD data available, (2) OOD data unavailable + in-distribution (ID) label available, and (3) OOD data unavailable + ID label unavailable. Third, we introduce datasets, applications, and metrics. Finally, we summarize existing work and present potential future research topics.
翻译:分布外检测对于机器学习系统在现实世界中可靠且安全的部署至关重要。近年来,该领域取得了重大进展。本文首次综述了分布外检测的最新进展,并特别关注自然语言处理方法。首先,我们给出了分布外检测的形式化定义,并讨论了若干相关领域。随后,根据所使用的数据,我们将近期算法分为三类:(1)分布外数据可用;(2)分布外数据不可用,但分布内标签可用;(3)分布外数据不可用,且分布内标签不可用。第三,我们介绍了数据集、应用和评估指标。最后,我们总结了现有工作,并提出了潜在的未来研究方向。