Approaches to Analysing Historical Newspapers Using LLMs

Filip Dobranić,Tina Munda,Oliver Pejić,Vojko Gorjanc,Uroš Šmajdek,David Bordon,Jakob Lenardič,Tjaša Konovšek,Kristina Pahor de Maiti Tekavčič,Ciril Bohak,Darja Fišer

This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

翻译：本研究对sPeriodika语料库中的斯洛文尼亚历史报纸《Slovenec》和《Slovenski narod》进行了计算分析，结合主题建模、基于大语言模型的方面级情感分析、实体图谱可视化和定性话语分析，探讨了20世纪之交公共话语中集体身份、政治取向和民族归属的呈现方式。利用BERTopic，我们识别出主要主题模式，并揭示了两家报纸之间的共同关注点和明显的意识形态差异，这反映了它们分别代表的保守-天主教和自由-进步取向。我们进一步评估了四种指令遵循型LLM在OCR退化的历史斯洛文尼亚语中进行目标情感分类的表现，并选取斯洛文尼亚语适配模型GaMS3-12B-Instruct作为大规模应用的最优选择，同时记录了其重要局限性，特别是对中性情感的分类性能优于对正面或负面情感的分类。在数据集规模上的应用表明，该模型揭示了集体身份描绘中的显著差异：某些群体主要出现在中性描述性语境中，而其他群体则更常出现在评价性或冲突相关的话语中。随后，我们构建命名实体识别图谱以探索集体身份与地点之间的关系，采用混合方法分析命名实体图谱，将定量网络分析与批判性话语分析相结合。研究重点关注交织的历史政治身份与社会经济身份的涌现与发展。总体而言，本研究证明了将可扩展计算方法与批判性阐释相结合，以支持针对噪声历史报纸数据的数字人文研究的价值。