Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, allowing both beginners and seasoned researchers to kick-start their projects with relative ease. Additionally, it provides interactive visualizations and demonstrations of classic models for educational purposes. The initial release of Amphion v0.1 supports a range of tasks including Text to Speech (TTS), Text to Audio (TTA), and Singing Voice Conversion (SVC), supplemented by essential components like data preprocessing, state-of-the-art vocoders, and evaluation metrics. This paper presents a high-level overview of Amphion.
翻译:Amphion是一个用于音频、音乐与语音生成的开源工具包,旨在降低初级研究人员和工程师进入这些领域的门槛。它提供了一个统一框架,涵盖多种生成任务与模型,且易于扩展以融入新内容。该工具包设计了适合初学者的工作流程和预训练模型,使初学者和经验丰富的研究人员都能相对轻松地启动项目。此外,它还提供经典模型的交互式可视化与演示,以辅助教学。Amphion v0.1的初始版本支持文本到语音(Text to Speech, TTS)、文本到音频(Text to Audio, TTA)和歌声转换(Singing Voice Conversion, SVC)等一系列任务,并辅以数据预处理、最先进声码器和评估指标等核心组件。本文对Amphion进行了高层次概述。