The task of predicting the publication period of text documents, such as news articles, is an important but less studied problem in the field of natural language processing. Predicting the year of a news article can be useful in various contexts, such as historical research, sentiment analysis, and media monitoring. In this work, we investigate the problem of predicting the publication period of a text document, specifically a news article, based on its textual content. In order to do so, we created our own extensive labeled dataset of over 350,000 news articles published by The New York Times over six decades. In our approach, we use a pretrained BERT model fine-tuned for the task of text classification, specifically for time period prediction.This model exceeds our expectations and provides some very impressive results in terms of accurately classifying news articles into their respective publication decades. The results beat the performance of the baseline model for this relatively unexplored task of time prediction from text.
翻译:文本发布周期预测,例如新闻文章的出版时间,是自然语言处理领域中一个重要但研究较少的问题。预测新闻文章的年份可应用于历史研究、情感分析和媒体监测等多种场景。本文研究了基于文本内容预测文档(具体为新闻文章)出版周期的问题。为此,我们构建了包含《纽约时报》六十年间超过35万篇新闻文章的大规模标注数据集。我们的方法采用针对文本分类任务(尤其是时间周期预测)微调的预训练BERT模型。该模型表现超出预期,在将新闻文章准确分类至对应出版年代方面取得了令人瞩目的成果。针对这一尚未充分探索的文本时间预测任务,我们的结果超越了基线模型的性能。