We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effects from a speech production model. We use the Pink Trombone synthesizer as a case study of a simplified production model of the vocal tract to target non-speech human audio signals --yawnings. We selected and optimized the control parameters of the synthesizer to minimize the difference between real and generated audio. We validated the most common optimization techniques reported in the literature and a specifically designed neural network. We evaluated several popular quality metrics as error functions. These include both objective quality metrics and subjective-equivalent metrics. We compared the results in terms of total error and computational demand. Results show that genetic and swarm optimizers outperform least squares algorithms at the cost of executing slower and that specific combinations of optimizers and audio representations offer significantly different results. The proposed methodology could be used in benchmarking other physical models and audio types.
翻译:我们提出了一种无监督方法,用于优化和评估基于语音产生模型的非语音音频效果合成。以Pink Trombone合成器作为声道简化产生模型的案例研究,针对非语音人类音频信号——哈欠声,我们选取并优化了合成器的控制参数,以最小化真实音频与生成音频之间的差异。我们验证了文献中最常见的优化技术以及一种专门设计的神经网络,并评估了多种主流质量指标作为误差函数,包括客观质量指标和主观等效指标。我们从总误差和计算需求两方面对结果进行了比较。结果表明,遗传算法和群体优化算法虽运行较慢,但性能优于最小二乘算法;且特定优化器与音频表示的组合能产生显著不同的结果。该研究方法可推广至其他物理模型和音频类型的基准测试。