Multi-label document classification is a traditional task in NLP. Compared to single-label classification, each document can be assigned multiple classes. This problem is crucially important in various domains, such as tagging scientific articles. Documents are often structured into several sections such as abstract and title. Current approaches treat different sections equally for multi-label classification. We argue that this is not a realistic assumption, leading to sub-optimal results. Instead, we propose a new method called Learning Section Weights (LSW), leveraging the contribution of each distinct section for multi-label classification. Via multiple feed-forward layers, LSW learns to assign weights to each section of, and incorporate the weights in the prediction. We demonstrate our approach on scientific articles. Experimental results on public (arXiv) and private (Elsevier) datasets confirm the superiority of LSW, compared to state-of-the-art multi-label document classification methods. In particular, LSW achieves a 1.3% improvement in terms of macro averaged F1-score while it achieves 1.3% in terms of macro averaged recall on the publicly available arXiv dataset.
翻译:多标签文档分类是自然语言处理中的传统任务。与单标签分类不同,每个文档可被赋予多个类别。该问题在多个领域具有关键重要性,例如科学文献标注。文档通常由多个章节(如摘要和标题)构成。当前方法在多标签分类中对不同章节一视同仁。我们认为这一假设不符合实际情况,会导致次优结果。为此,我们提出一种名为章节权重学习(LSW)的新方法,该方法利用每个不同章节对多标签分类的贡献。通过多个前馈层,LSW学习为文档的每个章节分配权重,并将这些权重融入预测过程。我们以科学文献为例验证了该方法。在公开数据集(arXiv)和私有数据集(Elsevier)上的实验结果表明,与最先进的多标签文档分类方法相比,LSW展现了优越性。特别地,在公开的arXiv数据集上,LSW的宏平均F1分数提升1.3%,宏平均召回率亦提升1.3%。