As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained ones, we explore various efficient fine-tuning approaches. Our experiment shows that the LoRA with bias-tuning configuration yields the best performance, enhancing controllability without compromising speech quality. Across three fine-grained conditional generation tasks, we demonstrate the effectiveness and resource efficiency of Voicebox Adapter. Follow-up experiments further highlight the robustness of Voicebox Adapter across diverse data setups.
翻译:随着生成模型规模持续扩大,预训练模型的高效复用与适配已成为关键考量。本研究提出Voicebox Adapter——一种利用交叉注意力模块将细粒度条件集成到预训练Voicebox语音生成模型中的创新方法。为确保新增模块与预训练模块的平滑集成,我们探索了多种高效微调策略。实验表明,采用带偏置微调配置的LoRA方案在保持语音质量的同时实现了最优的可控性提升。通过三项细粒度条件生成任务,我们验证了Voicebox Adapter的有效性与资源效率。后续实验进一步突显了Voicebox Adapter在不同数据配置下的鲁棒性。