The classification of short texts is a common subtask in Information Retrieval (IR). Recent advances in graph machine learning have led to interest in graph-based approaches for low resource scenarios, showing promise in such settings. However, existing methods face limitations such as not accounting for different meanings of the same words or constraints from transductive approaches. We propose an approach which constructs text graphs entirely based on tokens obtained through pre-trained language models (PLMs). By applying a PLM to tokenize and embed the texts when creating the graph(-nodes), our method captures contextual and semantic information, overcomes vocabulary constraints, and allows for context-dependent word meanings. Our approach also makes classification more efficient with reduced parameters compared to classical PLM fine-tuning, resulting in more robust training with few samples. Experimental results demonstrate how our method consistently achieves higher scores or on-par performance with existing methods, presenting an advancement in graph-based text classification techniques. To support reproducibility of our work we make all implementations publicly available to the community\footnote{\url{https://github.com/doGregor/TokenGraph}}.
翻译:短文本分类是信息检索领域中的一项常见子任务。图机器学习的最新进展引发了人们对低资源场景下图基方法的兴趣,这类方法在此类场景中展现出良好前景。然而,现有方法存在诸多局限,例如未能考虑同一词语的不同含义,或受到直推式方法的约束。本文提出一种完全基于预训练语言模型所获词元构建文本图的方法。通过应用PLM在图构建(节点生成)过程中对文本进行词元化与嵌入表示,我们的方法能够捕捉上下文与语义信息,突破词汇表限制,并支持上下文相关的词义理解。相较于经典的PLM微调方法,本方法通过减少参数量提升了分类效率,从而在少量样本条件下实现更稳健的训练。实验结果表明,我们的方法在各项评估中均能持续取得优于或持平现有方法的性能表现,推动了图基文本分类技术的发展。为支持研究的可复现性,我们已将全部实现代码公开\footnote{\url{https://github.com/doGregor/TokenGraph}}。