Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and/or a target magnitude objective. Nonetheless, it is still unclear which set of constraints and problem formulation is the most appropriate in practice. In this paper, we design a general framework for deriving spectrogram inversion algorithm, which is based on formulating optimization problems by combining these objectives either as soft penalties or hard constraints. We solve these by means of algorithms that perform alternating projections on the subsets corresponding to each objective/constraint. Our framework encompasses existing techniques from the literature as well as novel algorithms. We investigate the potential of these approaches for a speech enhancement task. In particular, one of our novel algorithms outperforms other approaches in a realistic setting where the magnitudes are estimated beforehand using a neural network.
翻译:音频源分离通常通过估计每个源的幅度语谱图,然后应用相位恢复(或语谱图反演)算法来重构时域信号。传统上,语谱图反演被视作一个优化问题,其中包含一个或多个约束项,以促进符合一致性特性、混合约束和/或目标幅度目标的估计值。然而,在实践中究竟哪一组约束和问题表述最为合适仍不明确。本文设计了一个通用的语谱图反演算法推导框架,该框架基于将这些目标作为软惩罚项或硬约束组合来构建优化问题。我们通过执行交替投影到每个目标/约束对应子集上的算法来求解这些问题。我们的框架涵盖了文献中的现有技术以及新提出的算法。我们研究了这些方法在语音增强任务中的潜力。特别地,在通过神经网络预先估计幅度值的现实场景中,我们提出的一种新算法优于其他方法。