Extreme classification (XC) involves predicting over large numbers of classes (thousands to millions), with real-world applications like news article classification and e-commerce product tagging. The zero-shot version of this task requires generalization to novel classes without additional supervision. In this paper, we develop SemSup-XC, a model that achieves state-of-the-art zero-shot and few-shot performance on three XC datasets derived from legal, e-commerce, and Wikipedia data. To develop SemSup-XC, we use automatically collected semantic class descriptions to represent classes and facilitate generalization through a novel hybrid matching module that matches input instances to class descriptions using a combination of semantic and lexical similarity. Trained with contrastive learning, SemSup-XC significantly outperforms baselines and establishes state-of-the-art performance on all three datasets considered, gaining up to 12 precision points on zero-shot and more than 10 precision points on one-shot tests, with similar gains for recall@10. Our ablation studies highlight the relative importance of our hybrid matching module and automatically collected class descriptions.
翻译:极分类(Extreme Classification, XC)涉及对大量类别(数千至数百万)进行预测,其实际应用包括新闻文章分类和电子商务产品标注。该任务的零样本变体要求在无需额外监督的情况下泛化至新类别。本文提出了SemSup-XC模型,该模型在法律、电子商务及维基百科数据构建的三个极分类数据集上,均取得了零样本与少样本场景下的最佳性能。为构建SemSup-XC,我们采用自动收集的语义类别描述来表示类别,并通过一种新颖的混合匹配模块促进泛化——该模块结合语义相似性与词汇相似性,将输入实例与类别描述进行匹配。经对比学习训练的SemSup-XC显著超越基线模型,在所有三个数据集上均树立了最佳性能标杆:在零样本测试中精度提升高达12个百分点,单样本测试中提升超过10个百分点,且在recall@10指标上取得类似增益。消融实验进一步凸显了我们混合匹配模块与自动收集类别描述的相对重要性。