Large language models (LLMs) increasingly serve as the backbone for classifying text associated with distinct domains and simultaneously several labels (classes). When encountering domain shifts, e.g., classifier of movie reviews from IMDb to Rotten Tomatoes, adapting such an LLM-based multi-label classifier is challenging due to incomplete label sets at the target domain and daunting training overhead. The existing domain adaptation methods address either image multi-label classifiers or text binary classifiers. In this paper, we design DALLMi, Domain Adaptation Large Language Model interpolator, a first-of-its-kind semi-supervised domain adaptation method for text data models based on LLMs, specifically BERT. The core of DALLMi is the novel variation loss and MixUp regularization, which jointly leverage the limited positively labeled and large quantity of unlabeled text and, importantly, their interpolation from the BERT word embeddings. DALLMi also introduces a label-balanced sampling strategy to overcome the imbalance between labeled and unlabeled data. We evaluate DALLMi against the partial-supervised and unsupervised approach on three datasets under different scenarios of label availability for the target domain. Our results show that DALLMi achieves higher mAP than unsupervised and partially-supervised approaches by 19.9% and 52.2%, respectively.
翻译:大语言模型(LLM)日益成为对跨领域文本及同时多标签进行分类的核心工具。当面临域偏移时(如电影评论分类器从IMDb迁移至Rotten Tomatoes),由于目标域标签集不完整且训练开销庞大,针对基于LLM的多标签分类器的域适应极具挑战性。现有域适应方法仅能解决图像多标签分类器或文本二分类器问题。本文设计了首个基于LLM(特别是BERT)的文本数据半监督域适应方法——DALLMi(域适应大语言模型插值器)。其核心创新在于变分损失与MixUp正则化,不仅联合利用少量正标注文本与大量无标注文本,更重要的是通过BERT词嵌入实现两者的插值融合。此外,DALLMi引入标签平衡采样策略以克服标注数据与未标注数据的不平衡问题。我们在三种数据集上,针对目标域不同标签可用性场景,将DALLMi与部分监督及无监督方法进行对比评估。实验结果表明,相较于无监督和部分监督方法,DALLMi的mAP分别提升19.9%和52.2%。