Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
翻译:从大规模非结构化历史报纸档案中提取连贯且人类可理解的主题,由于主题演变、光学字符识别(OCR)噪声以及文本体量庞大而面临显著挑战。传统主题建模方法,如潜在狄利克雷分布(LDA),往往难以捕捉历史文本中话语的复杂性和动态特性。为应对这些局限,我们采用BERTopic。这一神经主题建模方法利用基于Transformer的嵌入来提取和分类主题,尽管其日益流行,但在历史研究中仍未被充分利用。本研究聚焦于1955年至2018年间发表的文章,特别考察了关于核能与核安全的论述。我们分析了语料库中多样化的主题分布,并追踪其时间演化,以揭示公共话语中的长期趋势与转变。这使我们能够更准确地探索公共话语模式,包括核能与核武器相关主题的共现及其主题重要性随时间的变化。我们的研究证明了BERTopic作为传统方法替代方案的可扩展性和上下文敏感性,为从报纸档案中提取的历史话语提供了更丰富的洞见。这些发现有助于历史、核能及社会科学研究,同时反思当前局限并提出了未来工作的潜在方向。