Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields. Creative NLG, such as automatic poetry generation, is a fascinating niche in this area. While most previous research has focused on forms of the Turing test when evaluating automatic poetry generation -- can humans distinguish between automatic and human generated poetry -- we evaluate the diversity of automatically generated poetry (with a focus on quatrains), by comparing distributions of generated poetry to distributions of human poetry along structural, lexical, semantic and stylistic dimensions, assessing different model types (word vs. character-level, general purpose LLMs vs. poetry-specific models), including the very recent LLaMA3-8B, and types of fine-tuning (conditioned vs. unconditioned). We find that current automatic poetry systems are considerably underdiverse along multiple dimensions -- they often do not rhyme sufficiently, are semantically too uniform and even do not match the length distribution of human poetry. Our experiments reveal, however, that style-conditioning and character-level modeling clearly increases diversity across virtually all dimensions we explore. Our identified limitations may serve as the basis for more genuinely diverse future poetry generation models.
翻译:自然语言生成(NLG)以及更广泛的生成式人工智能,是当前最具影响力的研究领域之一。创意性自然语言生成,例如自动诗歌生成,是该领域中一个引人入胜的细分方向。以往的研究在评估自动诗歌生成时,大多聚焦于图灵测试的形式——即人类能否区分自动生成与人工创作的诗歌。与此不同,我们通过从结构、词汇、语义和风格等多个维度,比较生成诗歌与人类诗歌的分布,来评估自动生成诗歌(重点关注四行诗)的多样性。我们评估了不同的模型类型(词级别与字符级别、通用大语言模型与诗歌专用模型,包括最新的LLaMA3-8B模型)以及不同的微调方式(条件生成与无条件生成)。研究发现,当前的自动诗歌系统在多个维度上存在显著的多样性不足——它们往往押韵不够充分,语义上过于单一,甚至无法匹配人类诗歌的长度分布。然而,我们的实验也表明,风格条件生成和字符级别建模能显著提升我们在几乎所有探索维度上的多样性。我们所发现的这些局限性,或可为未来开发更具真正多样性的诗歌生成模型奠定基础。