In recent years, generated content in music has gained significant popularity, with large language models being effectively utilized to produce human-like lyrics in various styles, themes, and linguistic structures. This technological advancement supports artists in their creative processes but also raises issues of authorship infringement, consumer satisfaction and content spamming. To address these challenges, methods for detecting generated lyrics are necessary. However, existing works have not yet focused on this specific modality or on creative text in general regarding machine-generated content detection methods and datasets. In response, we have curated the first dataset of high-quality synthetic lyrics and conducted a comprehensive quantitative evaluation of various few-shot content detection approaches, testing their generalization capabilities and complementing this with a human evaluation. Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods, which are shown competitive in other domains at distinguishing human-written from machine-generated content. It also shows good generalization capabilities to new artists and models, and effectively detects post-generation paraphrasing. This study emphasizes the need for further research on creative content detection, particularly in terms of generalization and scalability with larger song catalogs. All datasets, pre-processing scripts, and code are available publicly on GitHub and Hugging Face under the Apache 2.0 license.
翻译:近年来,音乐领域的生成内容已获得显著普及,大型语言模型被有效用于生成具有多样风格、主题和语言结构的人类化歌词。这一技术进步在支持艺术家创作过程的同时,也引发了著作权侵权、消费者满意度及内容滥发等问题。为应对这些挑战,需要开发检测生成歌词的方法。然而,现有研究尚未聚焦于这一特定模态,也缺乏针对创意文本的机器生成内容检测方法与数据集。为此,我们构建了首个高质量合成歌词数据集,并对多种少样本内容检测方法进行了全面的定量评估,测试其泛化能力并辅以人工评估。我们基于LLM2Vec的最佳少样本检测器超越了风格与统计方法——这些方法在其他领域已被证明在区分人类撰写与机器生成内容方面具有竞争力。该检测器还展现出对新艺术家和新模型良好的泛化能力,并能有效检测生成后改写行为。本研究强调了对创意内容检测进行进一步研究的必要性,特别是在更大规模歌曲库的泛化能力与可扩展性方面。所有数据集、预处理脚本及代码均以Apache 2.0许可证公开于GitHub和Hugging Face平台。