Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to stereo by means of predicting parametric stereo (PS) parameters using both nearest neighbor and deep network approaches. In combination with PS, we also propose to model the task with generative approaches, allowing to synthesize multiple and equally-plausible stereo renditions from the same mono signal. To achieve this, we consider both autoregressive and masked token modelling approaches. We provide evidence that the proposed PS-based models outperform a competitive classical decorrelation baseline and that, within a PS prediction framework, modern generative models outshine equivalent non-generative counterparts. Overall, our work positions both PS and generative modelling as strong and appealing methodologies for mono-to-stereo upmixing. A discussion of the limitations of these approaches is also provided.
翻译:从单声道音频信号生成立体声呈现是一项具有挑战性的开放任务,尤其是当目标是实现具有特定声像定位效果的逼真空间成像时。本文提出通过最近邻方法与深度网络方法预测参数化立体声(PS)参数,实现单声道到立体声的转换。结合PS技术,我们进一步采用生成式方法对任务建模,从而能够从同一单声道信号合成多种同等合理的立体声版本。为实现这一目标,我们分别考虑了自回归建模与掩码标记建模方法。实验表明,所提出的基于PS的模型优于具有竞争力的经典去相关基线方法,且在PS预测框架内,现代生成模型的性能显著超越对应的非生成式模型。总体而言,本研究确立了PS与生成式建模作为单声道转立体声上混的强效且具有吸引力的方法论地位,并讨论了这些方法的局限性。