From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation.

翻译：本研究通过基于元音-辅音（V/C）编码的符号表示，探讨了但丁《神曲》的结构组织。将生成的序列建模为四状态马尔可夫链，可得到一个简约的字形记忆指标，该指标反映了持续模式与交替模式之间的平衡。在整个诗篇中，该指标从《地狱篇》到《天堂篇》呈现微弱但持续的增长，表明局部依赖结构存在定向变化。三元组层次分析显示，这一趋势由一组受限的重复配置驱动，这些配置被解释为将马尔可夫表示与文本中可识别的词汇环境联系起来的字形探针。这些探针表现出不同的行为：涉及两次转移的配置更频繁地出现在词汇边界处，反映了相邻词元之间的交互；而转移次数较少的配置主要局限于词内结构。部分信号进一步受到正字法现象（尤其是缩略形式）的影响，突显了书写惯例在音位和词汇组织之外的作用。一项补充分类分析识别了每篇专属的词汇，提供了词汇锚点，使字形探针可与诗篇结构相关联。这种组织不仅体现在三篇的分离上，也体现在贯穿文本的连续轨迹中。总体而言，结果显示，应用于符号化文本表示的简单概率模型能够揭示局部依赖、词汇分布、正字法编码与大尺度组织之间的结构化交互，为将局部符号动力学与更高层次的文本组织联系起来提供了一个可解释的框架。