Text classification is a fundamental problem in information retrieval with many real-world applications, such as predicting the topics of online articles and the categories of e-commerce product descriptions. However, low-resource text classification, with few or no labeled samples, poses a serious concern for supervised learning. Meanwhile, many text data are inherently grounded on a network structure, such as a hyperlink/citation network for online articles, and a user-item purchase network for e-commerce products. These graph structures capture rich semantic relationships, which can potentially augment low-resource text classification. In this paper, we propose a novel model called Graph-Grounded Pre-training and Prompting (G2P2) to address low-resource text classification in a two-pronged approach. During pre-training, we propose three graph interaction-based contrastive strategies to jointly pre-train a graph-text model; during downstream classification, we explore prompting for the jointly pre-trained model to achieve low-resource classification. Extensive experiments on four real-world datasets demonstrate the strength of G2P2 in zero- and few-shot low-resource text classification tasks.
翻译:文本分类是信息检索中的基础问题,具有诸多实际应用场景,例如预测网络文章主题与电商产品描述类别。然而,低资源条件下的文本分类——即标注样本极少甚至为零的情况——给监督学习带来了严峻挑战。同时,许多文本数据天然依托于网络结构,例如网络文章的超链接/引用网络、电商产品的用户-商品购买网络。这些图结构蕴含了丰富的语义关联,有望增强低资源文本分类性能。本文提出一种名为“基于图基预训练与提示”(G2P2)的新型模型,通过双管齐下的方法解决低资源文本分类问题。在预训练阶段,我们提出三种基于图交互的对比策略,联合预训练图-文本模型;在下游分类阶段,我们探索针对联合预训练模型的提示机制,以实现低资源分类。在四个真实数据集上的大量实验表明,G2P2在零样本与小样本低资源文本分类任务中具有显著优势。