We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).
翻译:本文提出voice2mode方法,利用从大规模自监督语音模型中提取的嵌入特征,对四种歌唱发声模式(气声、中性(正常)、流动声与挤压声)进行分类。以往关于歌唱发声的研究依赖于手工设计的信号特征或任务特定的神经网络;本研究评估了语音基础模型在歌唱发声分类任务上的可迁移性。voice2mode从HuBERT及两种wav2vec2变体中提取分层表征,应用全局时序池化,并通过轻量级分类器(SVM、XGBoost)对池化后的嵌入进行分类。在公开的女高音数据集(763段持续元音录音,四类标签)上的实验表明,基础模型特征显著优于传统频谱基线方法(频谱图、梅尔频谱图、MFCC)。从HuBERT早期层提取的嵌入取得了最佳结果(SVM分类准确率约95.7%),相较最佳传统基线实现约12-15%的绝对性能提升。研究还揭示了分层特性:保留声学/语音细节的底层表征,比专为自动语音识别(ASR)任务优化的顶层表征更具分类效力。