Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Evaluation and Modification - Aesthetic Selection". This framework seamlessly blends the interactive and knowledge-understanding features of LLMs with existing symbolic music generation models, thereby achieving a melody composition agent comparable to human creators. We conduct extensive experiments on GPT4 and several open-source large language models, which substantiate our framework's effectiveness. Furthermore, professional music composers were engaged in multi-dimensional evaluations, the final results demonstrated that across various facets of music composition, ByteComposer agent attains the level of a novice melody composer.
翻译:摘要:大语言模型(LLM)在多模态理解与生成任务中展现出令人鼓舞的进展。然而,如何设计一种符合人类认知规律且可解释的旋律创作系统仍是一个有待探索的问题。为解决这一挑战,我们提出字节作曲家(ByteComposer)框架,通过模拟人类创作流程中的四个独立阶段:“概念分析-草稿作曲-自我评估与修改-审美选择”,实现了智能体框架。该框架将大语言模型的交互性与知识理解能力与现有符号音乐生成模型有机融合,从而构建出可与人类创作者媲美的旋律创作智能体。我们在GPT4及多个开源大语言模型上进行了广泛实验,验证了框架的有效性。此外,邀请专业音乐作曲家进行多维度评估,最终结果表明,在音乐创作的多个维度上,字节作曲家智能体已达到初学者作曲家的水平。