Extreme Multi-label Completion for Semantic Document Labelling with Taxonomy-Aware Parallel Learning

In Extreme Multi Label Completion (XMLCo), the objective is to predict the missing labels of a collection of documents. Together with XML Classification, XMLCo is arguably one of the most challenging document classification tasks, as the very high number of labels (at least ten of thousands) is generally very large compared to the number of available labelled documents in the training dataset. Such a task is often accompanied by a taxonomy that encodes the labels organic relationships, and many methods have been proposed to leverage this hierarchy to improve the results of XMLCo algorithms. In this paper, we propose a new approach to this problem, TAMLEC (Taxonomy-Aware Multi-task Learning for Extreme multi-label Completion). TAMLEC divides the problem into several Taxonomy-Aware Tasks, i.e. subsets of labels adapted to the hierarchical paths of the taxonomy, and trains on these tasks using a dynamic Parallel Feature sharing approach, where some parts of the model are shared between tasks while others are task-specific. Then, at inference time, TAMLEC uses the labels available in a document to infer the appropriate tasks and to predict missing labels. To achieve this result, TAMLEC uses a modified transformer architecture that predicts ordered sequences of labels on a Weak-Semilattice structure that is naturally induced by the tasks. This approach yields multiple advantages. First, our experiments on real-world datasets show that TAMLEC outperforms state-of-the-art methods for various XMLCo problems. Second, TAMLEC is by construction particularly suited for few-shots XML tasks, where new tasks or labels are introduced with only few examples, and extensive evaluations highlight its strong performance compared to existing methods.

翻译：在极端多标签补全（XMLCo）任务中，目标在于预测文档集合中缺失的标签。与极端多标签分类（XML Classification）任务相似，XMLCo 可以说是最具挑战性的文档分类任务之一，因为标签数量（通常至少数万）相对于训练数据集中可用的标注文档数量而言通常非常庞大。此类任务通常伴随着一个编码标签有机关系的分类体系，已有许多方法被提出以利用这种层次结构来提升 XMLCo 算法的性能。本文中，我们针对该问题提出了一种新方法——TAMLEC（面向极端多标签补全的分类感知多任务学习）。TAMLEC 将问题分解为多个分类感知任务，即适应于分类体系层次路径的标签子集，并采用动态并行特征共享方法在这些任务上进行训练，其中模型的部分组件在任务间共享，而其他部分则是任务特定的。随后，在推理阶段，TAMLEC 利用文档中已有的标签来推断适当的任务并预测缺失的标签。为实现这一目标，TAMLEC 采用了一种改进的 Transformer 架构，该架构在由任务自然诱导的弱半格结构上预测有序的标签序列。这种方法具有多重优势。首先，我们在真实世界数据集上的实验表明，TAMLEC 在多种 XMLCo 问题上优于现有最先进的方法。其次，TAMLEC 在结构上特别适用于少样本 XML 任务，即新任务或标签仅通过少量示例引入的场景，广泛的评估突显了其相较于现有方法的强大性能。