We explore the problem of predicting the publication period of text document, such as a news article, using the text from that document. In order to do so, we created our own extensive labeled dataset of over 350,000 news articles published by The New York Times over six decades. We then provide an implementation of a simple Naive Bayes baseline model, which surprisingly achieves decent performance in terms of accuracy.Finally, for our approach, we use a pretrained BERT model fine-tuned for the task of text classification. This model exceeds our expectations and provides some very impressive results in terms of accurately classifying news articles into their respective publication decades. The results beat the performance of the few previously tried models for this relatively unexplored task of time prediction from text.
翻译:我们研究了利用文本内容预测新闻文章等文档发表时期的问题。为此,我们构建了一个包含超过35万篇《纽约时报》六十年间发表文章的大规模标注数据集。随后我们实现了简单的朴素贝叶斯基线模型,令人惊讶的是该模型在准确率方面表现尚可。最终,我们采用经过微调的预训练BERT模型来执行文本分类任务。该模型的表现超出预期,在将新闻文章准确归类至相应发表年代方面取得了极为出色的效果。其性能超越了此前针对"基于文本的时间预测"这一相对未充分探索任务所尝试的少数模型。