Hierarchical Text Classification (HTC) is a natural language processing task with the objective to classify text documents into a set of classes from a structured class hierarchy. Many HTC approaches have been proposed which attempt to leverage the class hierarchy information in various ways to improve classification performance. Machine learning-based classification approaches require large amounts of training data and are most-commonly compared through three established benchmark datasets, which include the Web Of Science (WOS), Reuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets. However, apart from the RCV1-V2 dataset which is well-documented, these datasets are not accompanied with detailed description methodologies. In this paper, we introduce three new HTC benchmark datasets in the domain of research publications which comprise the titles and abstracts of papers from the Web of Science publication database. We first create two baseline datasets which use existing journal-and citation-based classification schemas. Due to the respective shortcomings of these two existing schemas, we propose an approach which combines their classifications to improve the reliability and robustness of the dataset. We evaluate the three created datasets with a clustering-based analysis and show that our proposed approach results in a higher quality dataset where documents that belong to the same class are semantically more similar compared to the other datasets. Finally, we provide the classification performance of four state-of-the-art HTC approaches on these three new datasets to provide baselines for future studies on machine learning-based techniques for scientific publication classification.
翻译:层次文本分类(HTC)是一项自然语言处理任务,其目标是将文本文档分类到结构化类别层次结构中的一组类别。已有许多HTC方法被提出,这些方法尝试以不同方式利用类别层次信息来提升分类性能。基于机器学习的分类方法需要大量训练数据,通常通过三个已建立的基准数据集进行比较,包括Web Of Science(WOS)、Reuters Corpus Volume 1 Version 2(RCV1-V2)和New York Times(NYT)数据集。然而,除文档记录完善的RCV1-V2数据集外,这些数据集均未附带详细的描述方法。本文在研究出版物领域引入了三个新的HTC基准数据集,这些数据集包含来自Web of Science出版物数据库中论文的标题与摘要。我们首先创建了两个使用现有期刊与引文分类体系的基础数据集。鉴于这两种现有体系各自存在不足,我们提出一种整合其分类结果的方法,以提升数据集的可靠性与鲁棒性。通过基于聚类的分析对三个创建的数据集进行评估,结果表明:相较于其他数据集,我们提出的方法能生成更高质量的数据集,其中属于同一类别的文档在语义上具有更高的相似性。最后,我们提供了四种先进HTC方法在这三个新数据集上的分类性能,为未来基于机器学习的科学出版物分类技术研究提供基准参考。