The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present \textsc{Alberti}, the first multilingual pre-trained large language model for poetry. Through domain-specific pre-training (DSP), we further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. We evaluated its performance on two structural poetry tasks: Spanish stanza type classification, and metrical pattern prediction for Spanish, English and German. In both cases, \textsc{Alberti} outperforms multilingual BERT and other transformers-based models of similar sizes, and even achieves state-of-the-art results for German when compared to rule-based systems, demonstrating the feasibility and effectiveness of DSP in the poetry domain.
翻译:诗歌的计算分析因缺乏自动分析和扫描诗歌的工具而受到限制。在多语言环境下,这一问题更为严峻,因为扫描和韵律系统仅适用于单一语言,使得比较研究极具挑战且耗时。本文提出了\textsc{Alberti},这是首个面向诗歌的多语言预训练大语言模型。通过领域特定预训练(DSP),我们在涵盖12种语言超过1200万行诗歌的语料库上对多语言BERT进行了进一步训练。我们在两项结构性诗歌任务上评估其性能:西班牙语诗节类型分类,以及西班牙语、英语和德语的诗律模式预测。在这两项任务中,\textsc{Alberti}均优于多语言BERT及其他基于Transformer的同等规模模型,甚至在德语任务上超越了基于规则的系统,达到了最先进水平,证明了DSP在诗歌领域的可行性和有效性。