For extreme multi-label classification (XMC), existing classification-based models poorly perform for tail labels and often ignore the semantic relations among labels, like treating "Wikipedia" and "Wiki" as independent and separate labels. In this paper, we cast XMC as a generation task (XLGen), where we benefit from pre-trained text-to-text models. However, generating labels from the extremely large label space is challenging without any constraints or guidance. We, therefore, propose to guide label generation using label cluster information to hierarchically generate lower-level labels. We also find that frequency-based label ordering and using decoding ensemble methods are critical factors for the improvements in XLGen. XLGen with cluster guidance significantly outperforms the classification and generation baselines on tail labels, and also generally improves the overall performance in four popular XMC benchmarks. In human evaluation, we also find XLGen generates unseen but plausible labels. Our code is now available at https://github.com/alexa/xlgen-eacl-2023.
翻译:对于极端多标签分类(XMC),现有的基于分类的模型在尾部标签上表现不佳,且常常忽略标签间的语义关联,例如将"Wikipedia"和"Wiki"视为独立且分离的标签。本文将XMC建模为生成任务(XLGen),利用预训练的文本到文本模型。然而,在缺乏约束或指导的情况下,从极大的标签空间中生成标签具有挑战性。因此,我们提出利用标签聚类信息指导标签生成,以层次化方式生成较低级别的标签。我们还发现,基于频率的标签排序和使用解码集成方法是XLGen改进的关键因素。带有聚类指导的XLGen在尾部标签上显著优于分类和生成基线方法,并在四个流行的XMC基准测试中整体提升了性能。在人工评估中,我们还发现XLGen能够生成未见但合理的标签。我们的代码现已公开于https://github.com/alexa/xlgen-eacl-2023。