Instantaneous pitch estimation plays an important role in analyzing steep pitch variations such as speech prosody and singing techniques. Conventional approaches estimate instantaneous frequency after isolating the fundamental waveform from signals that contain harmonics and noise, which makes the accuracy sensitive to imperfect fundamental filtering. In this study, we formulate fundamental waveform filtering as a speech enhancement problem. Specifically, we train a Wave-U-Net model to extract a fundamental waveform from an input speech signal. The instantaneous pitch is then obtained by computing the instantaneous frequency from the analytic signal of the estimated fundamental waveform. Experimental results show that the proposed method outperforms conventional deterministic approaches and provides accurate and robust instantaneous pitch estimation across diverse domains, including speech, singing voice, musical instruments, and degraded speech signals.
翻译:瞬时基频估计在分析语音韵律和演唱技巧等急剧基频变化中起着重要作用。传统方法在从包含谐波和噪声的信号中分离基础波形后估计瞬时频率,其精度易受基础滤波不完善的影响。本研究将基础波形滤波形式化为语音增强问题,具体地,我们训练了一个Wave-U-Net模型从输入语音信号中提取基础波形,然后通过计算估计基础波形解析信号的瞬时频率获得瞬时基频。实验结果表明,所提方法优于传统确定性方法,能在语音、歌声、乐器及退化语音信号等多种领域提供准确且鲁棒的瞬时基频估计。