Introducing Three New Benchmark Datasets for Hierarchical Text Classification

Hierarchical Text Classification (HTC) is a natural language processing task with the objective to classify text documents into a set of classes from a structured class hierarchy. Many HTC approaches have been proposed which attempt to leverage the class hierarchy information in various ways to improve classification performance. Machine learning-based classification approaches require large amounts of training data and are most-commonly compared through three established benchmark datasets, which include the Web Of Science (WOS), Reuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets. However, apart from the RCV1-V2 dataset which is well-documented, these datasets are not accompanied with detailed description methodologies. In this paper, we introduce three new HTC benchmark datasets in the domain of research publications which comprise the titles and abstracts of papers from the Web of Science publication database. We first create two baseline datasets which use existing journal-and citation-based classification schemas. Due to the respective shortcomings of these two existing schemas, we propose an approach which combines their classifications to improve the reliability and robustness of the dataset. We evaluate the three created datasets with a clustering-based analysis and show that our proposed approach results in a higher quality dataset where documents that belong to the same class are semantically more similar compared to the other datasets. Finally, we provide the classification performance of four state-of-the-art HTC approaches on these three new datasets to provide baselines for future studies on machine learning-based techniques for scientific publication classification.

翻译：层次文本分类（HTC）是一项自然语言处理任务，其目标是将文本文档分类到结构化类别层次结构中的一组类别。已有许多HTC方法被提出，这些方法尝试以不同方式利用类别层次信息来提升分类性能。基于机器学习的分类方法需要大量训练数据，通常通过三个已建立的基准数据集进行比较，包括Web Of Science（WOS）、Reuters Corpus Volume 1 Version 2（RCV1-V2）和New York Times（NYT）数据集。然而，除文档记录完善的RCV1-V2数据集外，这些数据集均未附带详细的描述方法。本文在研究出版物领域引入了三个新的HTC基准数据集，这些数据集包含来自Web of Science出版物数据库中论文的标题与摘要。我们首先创建了两个使用现有期刊与引文分类体系的基础数据集。鉴于这两种现有体系各自存在不足，我们提出一种整合其分类结果的方法，以提升数据集的可靠性与鲁棒性。通过基于聚类的分析对三个创建的数据集进行评估，结果表明：相较于其他数据集，我们提出的方法能生成更高质量的数据集，其中属于同一类别的文档在语义上具有更高的相似性。最后，我们提供了四种先进HTC方法在这三个新数据集上的分类性能，为未来基于机器学习的科学出版物分类技术研究提供基准参考。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日