Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Evaluation and Modification - Aesthetic Selection". This framework seamlessly blends the interactive and knowledge-understanding features of LLMs with existing symbolic music generation models, thereby achieving a melody composition agent comparable to human creators. We conduct extensive experiments on GPT4 and several open-source large language models, which substantiate our framework's effectiveness. Furthermore, professional music composers were engaged in multi-dimensional evaluations, the final results demonstrated that across various facets of music composition, ByteComposer agent attains the level of a novice melody composer.
翻译:摘要:大语言模型在多模态理解与生成任务中展现出令人鼓舞的进展。然而,如何设计一种符合人类认知且可解释的旋律作曲系统仍是一个待深入探索的课题。针对这一问题,我们提出ByteComposer——一种通过四个独立阶段模拟人类创作流程的代理框架:"构思分析-草稿创作-自我评估与修改-审美选择"。该框架将大语言模型的交互与知识理解特性与现有符号音乐生成模型无缝融合,从而构建出能与人类创作者相媲美的旋律创作代理。我们在GPT-4及多个开源大语言模型上开展了大量实验,验证了框架的有效性。此外,邀请专业音乐作曲家进行多维度评估,最终结果表明,在音乐创作的多个方面,ByteComposer代理已达到新手旋律作曲家的水平。