Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

Singing style is a crucial aspect of a natural and expressive singing voice. Singers utilize singing styles to convey the feeling or emotion of the songs. Several works have been proposed to control singing style for making the more expressive singing voice. Recently, VibE-SVC successfully controls vibrato by predicting high-frequency F0 contour. In this paper, we introduce a singing voice conversion framework, called VibE-SVC2, to improve singing style conversion performance and controllability. The model offers control over two types of singing styles: a pitch style and a timbre style. For the pitch style, to resolve the pitch-energy entanglement issue that is unresolved in our previous work, we introduce a novel Energy Style Converter to address remaining style information in the energy contour. In addition, we propose a Zero-shot Pitch Style Converter, which mimics the pitch style of reference audio. To expand the controllability of the model, we propose vibrato rate scaling that is an independent control of vibrato extent, which is unavailable in VibE-SVC. For the timbre style, we extend the model to handle a variety of phonation styles. However, addressing specific styles such as vocal fry poses a challenge, as conventional F0 extraction often fails due to their inherent subharmonic characteristics, which degrades the conversion quality. To address this, we propose a novel Subharmonic Correction algorithm to refine the F0 contour for more natural timbre conversion. Through comprehensive objective and subjective evaluations, we demonstrate that VibE-SVC2 provides fine-grained, independent control over two types of singing styles, outperforming existing methods.

翻译：歌唱风格是自然且富有表现力的歌声的关键要素。歌手通过歌唱风格传达歌曲的情感与情绪。已有若干研究致力于控制歌唱风格以生成更具表现力的歌声。近期，VibE-SVC通过预测高频基频（F0）轮廓成功实现了颤音控制。本文提出一种名为VibE-SVC2的歌声转换框架，旨在提升歌唱风格转换性能与可控性。该模型可控制两种歌唱风格：音高风格与音色风格。针对音高风格，为解决先前工作中未解决的音高-能量纠缠问题，我们引入一种新型能量风格转换器（Energy Style Converter），以处理能量轮廓中残留的风格信息。此外，我们提出零样本音高风格转换器（Zero-shot Pitch Style Converter），可模仿参考音频的音高风格。为扩展模型可控性，我们提出颤音速率缩放（vibrato rate scaling）方法，实现对颤音幅度的独立控制（该功能在VibE-SVC中不可用）。针对音色风格，我们扩展模型以处理多种发声风格。然而，处理如气泡音等特定风格存在挑战，因其固有的次谐波特性导致传统F0提取常失效，进而降低转换质量。为此，我们提出次谐波校正（Subharmonic Correction）算法，以优化F0轮廓，实现更自然的音色转换。通过全面的客观与主观评估，我们证明VibE-SVC2能够对两种歌唱风格进行细粒度、独立的控制，性能优于现有方法。