The eXtreme Multi-label Classification~(XMC) problem seeks to find relevant labels from an exceptionally large label space. Most of the existing XMC learners focus on the extraction of semantic features from input query text. However, conventional XMC studies usually neglect the side information of instances and labels, which can be of use in many real-world applications such as recommendation systems and e-commerce product search. We propose Predicted Instance Neighborhood Aggregation (PINA), a data enhancement method for the general XMC problem that leverages beneficial side information. Unlike most existing XMC frameworks that treat labels and input instances as featureless indicators and independent entries, PINA extracts information from the label metadata and the correlations among training instances. Extensive experimental results demonstrate the consistent gain of PINA on various XMC tasks compared to the state-of-the-art methods: PINA offers a gain in accuracy compared to standard XR-Transformers on five public benchmark datasets. Moreover, PINA achieves a $\sim 5\%$ gain in accuracy on the largest dataset LF-AmazonTitles-1.3M. Our implementation is publicly available.
翻译:摘要:极端多标签分类(XMC)问题旨在从极其庞大的标签空间中找出相关标签。现有的大多数XMC学习方法侧重于从输入查询文本中提取语义特征。然而,传统的XMC研究通常忽略了实例和标签的辅助信息,而这些信息在推荐系统、电子商务产品搜索等许多实际应用中具有重要价值。我们提出预测实例邻域聚合(PINA)方法,这是一种面向通用XMC问题的数据增强方法,能够有效利用有益的辅助信息。与大多数将标签和输入实例视为无特征指标和独立条目的现有XMC框架不同,PINA从标签元数据和训练实例之间的相关性中提取信息。大量实验结果表明,相较于最先进的方法,PINA在各类XMC任务上均能实现稳定增益:在五个公开基准数据集上,与标准XR-Transformer相比,PINA带来了准确率的提升。此外,在最大规模数据集LF-AmazonTitles-1.3M上,PINA的准确率提升了约5%。我们的实现代码已公开提供。