Language features are evolving in real-world social media, resulting in the deteriorating performance of text classification in dynamics. To address this challenge, we study temporal adaptation, where models trained on past data are tested in the future. Most prior work focused on continued pretraining or knowledge updating, which may compromise their performance on noisy social media data. To tackle this issue, we reflect feature change via modeling latent topic evolution and propose a novel model, VIBE: Variational Information Bottleneck for Evolutions. Concretely, we first employ two Information Bottleneck (IB) regularizers to distinguish past and future topics. Then, the distinguished topics work as adaptive features via multi-task training with timestamp and class label prediction. In adaptive learning, VIBE utilizes retrieved unlabeled data from online streams created posterior to training data time. Substantial Twitter experiments on three classification tasks show that our model, with only 3% of data, significantly outperforms previous state-of-the-art continued-pretraining methods.
翻译:现实社交媒体中的语言特征持续演变,导致文本分类在动态场景下性能不断下降。针对这一挑战,我们研究了时间自适应问题——即利用历史数据训练的模型需适用于未来数据。现有工作主要聚焦于持续预训练或知识更新,但这类方法在处理噪声社交媒体数据时可能性能受损。为解决该问题,我们通过建模潜在主题演变来反映特征变化,并提出创新模型VIBE(面向演进的变分信息瓶颈)。具体而言,我们首先采用两个信息瓶颈正则化器区分过去与未来主题,然后通过带时间戳和类别标签预测的多任务训练,使区分后的主题作为自适应特征。在自适应学习中,VIBE利用训练数据时间之后在线流中检索到的未标注数据。在三个分类任务上的大量推特实验表明,我们的模型仅需3%的数据量,即可显著超越当前最优的持续预训练方法。