Instead of relying on human-annotated training samples to build a classifier, weakly supervised scientific paper classification aims to classify papers only using category descriptions (e.g., category names, category-indicative keywords). Existing studies on weakly supervised paper classification are less concerned with two challenges: (1) Papers should be classified into not only coarse-grained research topics but also fine-grained themes, and potentially into multiple themes, given a large and fine-grained label space; and (2) full text should be utilized to complement the paper title and abstract for classification. Moreover, instead of viewing the entire paper as a long linear sequence, one should exploit the structural information such as citation links across papers and the hierarchy of sections and paragraphs in each paper. To tackle these challenges, in this study, we propose FUTEX, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision. A network-aware contrastive fine-tuning module and a hierarchy-aware aggregation module are designed to leverage the two types of structural signals, respectively. Experiments on two benchmark datasets demonstrate that FUTEX significantly outperforms competitive baselines and is on par with fully supervised classifiers that use 1,000 to 60,000 ground-truth training samples.
翻译:为摆脱对人工标注训练样本的依赖,弱监督科学论文分类旨在仅利用类别描述(如类别名称、类别指示性关键词)对论文进行分类。现有弱监督论文分类研究较少关注两个挑战:(1) 在标签空间大且粒度细的情况下,论文不仅应被分类至粗粒度研究主题,还需进一步划分至细粒度研究主题,且可能同时归属多个主题;(2) 应利用论文全文作为标题和摘要的补充信息。此外,不应将整篇论文视为单一长序列,而需挖掘跨论文的引用链接、每篇论文的章节与段落层次等结构信息。为应对这些挑战,本文提出FUTEX框架——在弱监督条件下利用跨论文网络结构与论文内层次结构对全文科学论文进行分类。我们分别设计了网络感知对比微调模块和层次感知聚合模块来利用这两类结构信号。在两个基准数据集上的实验表明,FUTEX显著优于竞争基线模型,其性能与使用1000至60000条真实标注训练样本的全监督分类器相当。