The increasing availability of large-scale textual corpora has opened new possibilities for data-driven, quantitative approaches to historical analysis using Natural Language Processing (NLP). However, diachronic corpora with historical relevance from the pre-digital era remain scarce and often incomplete. We present a quantitative approach to historical analysis based on the reconstruction and exploration of a diachronic corpus of around 600,000 articles from the Italian newspaper "La Repubblica", covering all the articles published from the 1st of January 1985 to the 31st of December 2000 - a period of major political, social, and geopolitical change in Italy and globally. Using NLP techniques, we analyze the text at both lexical and semantic levels; we then apply tools from complex systems and statistical physics to trace shifts in media discourse over time. This allows us to detect key transition periods, such as the transition from the First Republic to the Second Republic in Italy, or major international conflicts like the Gulf War or the Kosovo War, without relying on prior labeling. The results show how combining computational linguistics with ideas from complex systems can offer new quantitative insight into historical changes, opening up new paths for studying the dynamics of media and society through large-scale textual data.
翻译:大规模文本语料库的日益普及为利用自然语言处理(NLP)进行数据驱动的定量历史分析开辟了新途径。然而,前数字时代具有历史意义的历时语料库仍然稀缺且往往不完整。我们提出了一种基于重构和探索历时语料库的历史定量分析方法,该语料库包含约60万篇意大利《共和国报》文章,涵盖了从1985年1月1日至2000年12月31日期间发表的所有文章——这一时期正值意大利乃至全球经历重大政治、社会与地缘政治变革。运用NLP技术,我们从词汇和语义两个层面对文本进行分析;随后应用复杂系统与统计物理学工具追踪媒体话语在时间上的变迁。这使我们无需依赖先验标注即可检测到关键过渡期,例如意大利从第一共和国向第二共和国的转型,以及海湾战争或科索沃战争等重大国际冲突。研究结果表明,将计算语言学与复杂系统思想相结合,能为历史变迁提供全新的定量洞察,从而开辟通过大规模文本数据研究媒体与社会动态的新路径。