Hierarchical text classification (HTC) is essential for various real applications. However, HTC models are challenging to develop because they often require processing a large volume of documents and labels with hierarchical taxonomy. Recent HTC models based on deep learning have attempted to incorporate hierarchy information into a model structure. Consequently, these models are challenging to implement when the model parameters increase for a large-scale hierarchy because the model structure depends on the hierarchy size. To solve this problem, we formulate HTC as a sub-hierarchy sequence generation to incorporate hierarchy information into a target label sequence instead of the model structure. Subsequently, we propose the Hierarchy DECoder (HiDEC), which decodes a text sequence into a sub-hierarchy sequence using recursive hierarchy decoding, classifying all parents at the same level into children at once. In addition, HiDEC is trained to use hierarchical path information from a root to each leaf in a sub-hierarchy composed of the labels of a target document via an attention mechanism and hierarchy-aware masking. HiDEC achieved state-of-the-art performance with significantly fewer model parameters than existing models on benchmark datasets, such as RCV1-v2, NYT, and EURLEX57K.
翻译:层次文本分类(HTC)是多种实际应用中的关键任务。然而,由于HTC模型通常需要处理大量文档和具有层次分类体系的标签,其开发具有挑战性。近年基于深度学习的HTC模型尝试将层次信息融入模型结构,但这类模型的参数会随大规模层次结构增加而膨胀,导致实现困难——因为模型结构直接依赖于层次规模。为解决此问题,我们将HTC重构为子层次序列生成任务,将层次信息注入目标标签序列而非模型结构。据此提出层次解码器(HiDEC),该解码器通过递归层次解码将文本序列转化为子层次序列,并一次性将同级父标签分类为子标签。此外,HiDEC通过注意力机制和层次感知掩码,利用从根节点到目标文档标签构成的子层次中各叶节点的层次路径信息进行训练。在RCV1-v2、NYT和EURLEX57K等基准数据集上,HiDEC以显著少于现有模型的参数数量实现了最先进性能。