We present a physics-informed voiced backend renderer for singing-voice synthesis. Given synthetic single-channel audio and a fund-amental--frequency trajectory, we train a time-domain Webster model as a physics-informed neural network to estimate an interpretable vocal-tract area function and an open-end radiation coefficient. Training enforces partial differential equation and boundary consistency; a lightweight DDSP path is used only to stabilize learning, while inference is purely physics-based. On sustained vowels (/a/, /i/, /u/), parameters rendered by an independent finite-difference time-domain Webster solver reproduce spectral envelopes competitively with a compact DDSP baseline and remain stable under changes in discretization, moderate source variations, and about ten percent pitch shifts. The in-graph waveform remains breathier than the reference, motivating periodicity-aware objectives and explicit glottal priors in future work.
翻译:本文提出了一种基于物理信息的歌唱合成有声后端渲染器。给定合成单通道音频与基频轨迹,我们将时域Webster模型作为物理信息神经网络进行训练,以估计可解释的声道面积函数与开口端辐射系数。训练过程强制满足偏微分方程与边界一致性;轻量级DDSP路径仅用于稳定学习过程,而推理则完全基于物理原理。在持续元音(/a/、/i/、/u/)测试中,通过独立有限差分时域Webster求解器渲染的参数所重构的频谱包络,在性能上可与紧凑型DDSP基线相竞争,且在离散化方式改变、适度声源变化及约百分之十音高偏移条件下仍保持稳定。计算图内波形相较于参考音频仍存在更多气声成分,这为未来研究提出了周期性感知目标与显式声门先验的改进方向。