The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.
翻译:机器学习(ML)模型的基本前提假设是训练数据与测试数据来自同一分布。然而,在日常实践中,这一假设常被打破,即测试数据的分布随时间发生变化,这阻碍了传统ML模型的应用。文本分类是分布偏移自然发生的领域之一,因为人们总会发现新的讨论主题。为此,本文综述了研究开放集文本分类及相关任务的学术文献。我们根据定义分布偏移类型及相应问题表述的约束条件,将该领域方法分为三类:通用集学习、零样本学习与开放集学习。随后,我们讨论了每种问题设置下的主流缓解方法。最后,我们指出了若干未来研究方向,旨在推动该领域超越现有技术水平。值得注意的是,我们发现持续学习能够解决由类别分布偏移引发的诸多问题。相关论文列表维护于 https://github.com/Eduard6421/Open-Set-Survey。