Transformer-based machine learning models have become an essential tool for many natural language processing (NLP) tasks since the introduction of the method. A common objective of these projects is to classify text data. Classification models are often extended to a different topic and/or time period. In these situations, deciding how long a classification is suitable for and when it is worth re-training our model is difficult. This paper compares different approaches to fine-tune a BERT model for a long-running classification task. We use data from different periods to fine-tune our original BERT model, and we also measure how a second round of annotation could boost the classification quality. Our corpus contains over 8 million comments on COVID-19 vaccination in Hungary posted between September 2020 and December 2021. Our results show that the best solution is using all available unlabeled comments to fine-tune a model. It is not advisable to focus only on comments containing words that our model has not encountered before; a more efficient solution is randomly sample comments from the new period. Fine-tuning does not prevent the model from losing performance but merely slows it down. In a rapidly changing linguistic environment, it is not possible to maintain model performance without regularly annotating new text.
翻译:基于Transformer的机器学习模型自问世以来,已成为许多自然语言处理(NLP)任务的核心工具。这类项目的常见目标是对文本数据进行分类。分类模型常被迁移至不同主题和/或时间周期。在这些情境下,决定分类模型的有效时长以及何时值得重新训练模型存在困难。本文比较了针对长期分类任务微调BERT模型的不同方法。我们使用不同时期的数据来微调原始BERT模型,并评估第二轮标注对提升分类质量的促进作用。我们的语料库包含2020年9月至2021年12月期间匈牙利发布的超过800万条关于COVID-19疫苗接种的评论。结果表明,最优方案是利用所有未标注评论进行模型微调。仅关注包含模型未遇词汇的评论并非明智之举,更高效的方法是从新周期中随机抽样评论。微调并不能阻止模型性能衰减,而仅能减缓其退化速度。在快速演变的语言环境中,若不定期标注新文本,维持模型性能将难以实现。