Our research focuses on solving the zero-shot text classification problem in NLP, with a particular emphasis on innovative self-training strategies. To achieve this objective, we propose a novel self-training strategy that uses labels rather than text for training, significantly reducing the model's training time. Specifically, we use categories from Wikipedia as our training set and leverage the SBERT pre-trained model to establish positive correlations between pairs of categories within the same text, facilitating associative training. For new test datasets, we have improved the original self-training approach, eliminating the need for prior training and testing data from each target dataset. Instead, we adopt Wikipedia as a unified training dataset to better approximate the zero-shot scenario. This modification allows for rapid fine-tuning and inference across different datasets, greatly reducing the time required for self-training. Our experimental results demonstrate that this method can adapt the model to the target dataset within minutes. Compared to other BERT-based transformer models, our approach significantly reduces the amount of training data by training only on labels, not the actual text, and greatly improves training efficiency by utilizing a unified training set. Additionally, our method achieves state-of-the-art results on both the Yahoo Topic and AG News datasets.
翻译:我们的研究聚焦于NLP中的零样本文本分类问题,尤其关注创新的自训练策略。为实现此目标,我们提出了一种新颖的自训练方法:仅使用标签而非文本进行训练,从而显著缩短模型训练时间。具体而言,我们以Wikipedia类别作为训练集,利用SBERT预训练模型在同一文本对应的类别对之间建立正向关联,实现关联性训练。对于新的测试数据集,我们改进了原始自训练方法,无需每个目标数据集提供先验训练与测试数据,而是采用Wikipedia作为统一训练数据集以更好地逼近零样本场景。这一改进使模型能够跨不同数据集快速微调与推理,大幅缩短自训练所需时间。实验结果表明,该方法可在数分钟内使模型适配目标数据集。与其他基于BERT的Transformer模型相比,我们的方法仅通过标签(而非真实文本)进行训练,极大减少了训练数据量,并通过统一训练集显著提升了训练效率。此外,该方法在Yahoo Topic和AG News数据集上均达到了当前最优性能。