Text classification with hierarchical labels is a prevalent and challenging task in natural language processing. Examples include assigning ICD codes to patient records, tagging patents into IPC classes, assigning EUROVOC descriptors to European legal texts, and more. Despite its widespread applications, a comprehensive understanding of state-of-the-art methods across different domains has been lacking. In this paper, we provide the first comprehensive cross-domain overview with empirical analysis of state-of-the-art methods. We propose a unified framework that positions each method within a common structure to facilitate research. Our empirical analysis yields key insights and guidelines, confirming the necessity of learning across different research areas to design effective methods. Notably, under our unified evaluation pipeline, we achieved new state-of-the-art results by applying techniques beyond their original domains.
翻译:层次化标签文本分类是自然语言处理中普遍存在且具有挑战性的任务。其应用包括为患者记录分配ICD编码、将专利标记至IPC分类、为欧洲法律文本分配EUROVOC描述符等。尽管应用广泛,但学界对不同领域最先进方法的全面理解仍显不足。本文首次通过实证分析提供了跨领域最先进方法的全面综述。我们提出了一个统一框架,将各类方法置于共同结构中以便于研究。实证分析得出了关键见解与指导原则,证实了跨越不同研究领域进行学习对于设计有效方法的必要性。值得注意的是,在我们的统一评估框架下,通过将技术应用于其原始领域之外的场景,我们取得了新的最先进成果。