Open-set learning and discovery (OSLD) is a challenging machine learning task in which samples from new (unknown) classes can appear at test time. It can be seen as a generalization of zero-shot learning, where the new classes are not known a priori, hence involving the active discovery of new classes. While zero-shot learning has been extensively studied in text classification, especially with the emergence of pre-trained language models, open-set learning and discovery is a comparatively new setup for the text domain. To this end, we introduce the first multilingual open-set learning and discovery (MOSLD) benchmark for text categorization by topic, comprising 960K data samples across 12 languages. To construct the benchmark, we (i) rearrange existing datasets and (ii) collect new data samples from the news domain. Moreover, we propose a novel framework for the OSLD task, which integrates multiple stages to continuously discover and learn new classes. We evaluate several language models, including our own, to obtain results that can be used as reference for future work. We release our benchmark at https://github.com/Adriana19Valentina/MOSLD-Bench.
翻译:开放集学习与发现(OSLD)是一项具有挑战性的机器学习任务,在该任务中,测试阶段可能出现来自新(未知)类别的样本。它可以被视为零样本学习的一种泛化形式,其中新类别并非先验已知,因此涉及对新类别的主动发现。尽管零样本学习在文本分类领域已得到广泛研究,尤其是在预训练语言模型兴起之后,但开放集学习与发现对于文本领域而言仍是一个相对较新的设定。为此,我们首次提出了一个面向主题文本分类的多语言开放集学习与发现(MOSLD)基准,该基准包含12种语言共计96万个数据样本。为构建此基准,我们(i)重组了现有数据集,并(ii)从新闻领域收集了新的数据样本。此外,我们针对OSLD任务提出了一种新颖的框架,该框架整合了多个阶段以持续发现和学习新类别。我们评估了包括我们自身模型在内的多种语言模型,以获得可作为未来工作参考的结果。我们的基准发布于 https://github.com/Adriana19Valentina/MOSLD-Bench。