This study presents a hybrid topic modelling framework for computational literary analysis that integrates Latent Dirichlet Allocation (LDA) with sparse Partial Least Squares Discriminant Analysis (sPLS-DA) to model thematic structure and longitudinal dynamics in narrative poetry. As a case study, we analyse Evgenij Onegin-Aleksandr S. Pushkin's novel in verse-using an Italian translation, testing whether unsupervised and supervised lexical structures converge in a small-corpus setting. The poetic text is segmented into thirty-five documents of lemmatised content words, from which five stable and interpretable topics emerge. To address small-corpus instability, a multi-seed consensus protocol is adopted. Using sPLS-DA as a supervised probe enhances interpretability by identifying lexical markers that refine each theme. Narrative hubs-groups of contiguous stanzas marking key episodes-extend the bag-of-words approach to the narrative level, revealing how thematic mixtures align with the poem's emotional and structural arc. Rather than replacing traditional literary interpretation, the proposed framework offers a computational form of close reading, illustrating how lightweight probabilistic models can yield reproducible thematic maps of complex poetic narratives, even when stylistic features such as metre, phonology, or native morphology are abstracted away. Despite relying on a single lemmatised translation, the approach provides a transparent methodological template applicable to other high-density literary texts in comparative studies.
翻译:本研究提出一种面向计算文学分析的混合主题建模框架,该框架融合潜在狄利克雷分配(LDA)与稀疏偏最小二乘判别分析(sPLS-DA),用于建模叙事诗歌中的主题结构与动态演变。以亚历山大·谢尔盖耶维奇·普希金的诗体小说《叶甫盖尼·奥涅金》的意大利语译本为案例,本研究检验了无监督与有监督词汇结构在小语料场景下的收敛性。将诗歌文本分割为35个词形还原后的实词文档后,提取出五个稳定且可解释的主题。针对小语料的不稳定性,采用多初始值共识协议。通过将sPLS-DA作为有监督探针,识别出细化各主题的词汇标记,从而增强可解释性。叙事枢纽——即标记关键情节的连续诗节组——将词袋模型扩展至叙事层面,揭示主题混合如何与诗歌的情感与结构脉络相契合。该框架并非取代传统文学阐释,而是提供了一种计算型细读形式,论证了即使忽略韵律、音系或源语言形态等文体特征,轻量概率模型仍能生成复杂诗歌叙事的可复现主题图谱。尽管仅依赖单一词形还原译本,该方法仍为比较研究中的其他高密度文学文本提供了透明的可复制范式。