Multi-label text classification (MLC) is a challenging task in settings of large label sets, where label support follows a Zipfian distribution. In this paper, we address this problem through retrieval augmentation, aiming to improve the sample efficiency of classification models. Our approach closely follows the standard MLC architecture of a Transformer-based encoder paired with a set of classification heads. In our case, however, the input document representation is augmented through cross-attention to similar documents retrieved from the training set and represented in a task-specific manner. We evaluate this approach on four datasets from the legal and biomedical domains, all of which feature highly skewed label distributions. Our experiments show that retrieval augmentation substantially improves model performance on the long tail of infrequent labels especially so for lower-resource training scenarios and more challenging long-document data scenarios.
翻译:多标签文本分类(Multi-label Text Classification, MLC)在大标签集场景中是一项具有挑战性的任务,其中标签支持度遵循齐普夫分布。本文通过检索增强方法解决该问题,旨在提升分类模型的样本效率。我们的方法紧密遵循基于Transformer编码器配合一组分类头的标准MLC架构。然而在本方案中,输入文档的表示通过交叉注意力机制进行增强,该机制作用于从训练集中检索到的相似文档,并以任务特定方式表示这些文档。我们在法律和生物医学领域的四个数据集上评估该方法,所有这些数据集均呈现高度偏斜的标签分布。实验表明,检索增强在低频标签的长尾分布上显著提升了模型性能,尤其在低资源训练场景和更具挑战性的长文档数据场景中效果更为突出。