In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without any labeled examples for the target classes. A common method for ZSC is to fine-tune a language model on a Natural Language Inference (NLI) dataset and then use it to infer the entailment between the input document and the target labels. However, this approach faces certain challenges, particularly for languages with limited resources. In this paper, we propose an alternative solution that leverages dictionaries as a source of data for ZSC. We focus on Luxembourgish, a low-resource language spoken in Luxembourg, and construct two new topic relevance classification datasets based on a dictionary that provides various synonyms, word translations and example sentences. We evaluate the usability of our dataset and compare it with the NLI-based approach on two topic classification tasks in a zero-shot manner. Our results show that by using the dictionary-based dataset, the trained models outperform the ones following the NLI-based approach for ZSC. While we focus on a single low-resource language in this study, we believe that the efficacy of our approach can also transfer to other languages where such a dictionary is available.
翻译:在自然语言处理中,零样本分类是指对文本数据分配标签的任务,且无需目标类别的任何标注样本。一种常见的零样本分类方法是在自然语言推理数据集上微调语言模型,进而推断输入文档与目标标签之间的蕴含关系。然而,这种方法面临特定挑战,尤其是在资源有限的语种中。本文提出一种替代方案,利用字典作为零样本分类的数据源。我们聚焦于卢森堡语这一低资源语言,基于一部提供多种同义词、词语翻译及示例句子的字典,构建了两个新的主题相关性分类数据集。我们评估了数据集的可用性,并在两个主题分类任务中以零样本方式将其与基于自然语言推理的方法进行比较。结果表明,使用基于字典的数据集训练出的模型在零样本分类任务上优于采用自然语言推理方法的模型。尽管本研究仅聚焦于单一低资源语言,但我们相信该方法的有效性也可迁移至其他拥有此类字典的语言中。