Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.
翻译:科学文本的多标签分类面临极端类别不平衡问题,专业术语呈现严重的幂律分布,这对标准分类方法构成挑战。现有科学语料库缺乏全面的受控词汇表,往往侧重于宽泛类别,限制了对极端不平衡问题的系统性研究。我们提出 AstroConcepts 语料库,包含来自 21,702 篇已发表天体物理论文的英文摘要,并使用统一天文学词汇表中的 2,367 个概念进行标注。该语料库表现出严重的标签不平衡:76% 的概念对应的训练样本不足 50 个。通过发布这一资源,我们得以系统研究科学领域的极端不平衡问题,并为传统方法、神经网络方法及词汇受限的大语言模型方法建立了强基线。我们的评估揭示了三个关键模式,为科学文本分类提供了新见解:第一,在天体物理分类任务中,词汇受限的大语言模型在性能上可与领域适配模型相媲美,这意味着参数高效方法具有潜力;第二,领域适配对稀有专业术语的改进相对较大,但所有方法的绝对性能仍然有限;第三,我们提出按频率分层评估,以揭示被综合评价指标掩盖的性能模式,从而使鲁棒性评估成为科学多标签分类评估的核心。这些结果为科学自然语言处理提供了可操作的见解,并为极端不平衡问题的研究建立了基准。