Language features are evolving in real-world social media, resulting in the deteriorating performance of text classification in dynamics. To address this challenge, we study temporal adaptation, where models trained on past data are tested in the future. Most prior work focused on continued pretraining or knowledge updating, which may compromise their performance on noisy social media data. To tackle this issue, we reflect feature change via modeling latent topic evolution and propose a novel model, VIBE: Variational Information Bottleneck for Evolutions. Concretely, we first employ two Information Bottleneck (IB) regularizers to distinguish past and future topics. Then, the distinguished topics work as adaptive features via multi-task training with timestamp and class label prediction. In adaptive learning, VIBE utilizes retrieved unlabeled data from online streams created posterior to training data time. Substantial Twitter experiments on three classification tasks show that our model, with only 3% of data, significantly outperforms previous state-of-the-art continued-pretraining methods.
翻译:语言特征在现实社交媒体中持续演变,导致文本分类在动态环境中的性能不断下降。针对这一挑战,我们研究时间适应问题,即让基于历史数据训练的模型对未来数据进行预测。现有工作多聚焦于持续预训练或知识更新,这可能在噪声社交媒体数据上损害模型性能。为解决此问题,我们从隐式主题演化角度反思特征变化,并提出创新模型VIBE:面向演化的变分信息瓶颈。具体而言,我们首先采用两个信息瓶颈(IB)正则化器区分过去与未来主题;继而通过多任务训练(联合时间戳与类别标签预测)将这些区分后的主题作为自适应特征。在自适应学习过程中,VIBE利用从训练数据时间之后创建的在线流中检索到的无标注数据。在三个分类任务上的大量推特实验表明,我们的模型仅需3%的数据量即可显著超越此前最先进的持续预训练方法。