Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and/or a target magnitude objective. Nonetheless, it is still unclear which set of constraints and problem formulation is the most appropriate in practice. In this paper, we design a general framework for deriving spectrogram inversion algorithm, which is based on formulating optimization problems by combining these objectives either as soft penalties or hard constraints. We solve these by means of algorithms that perform alternating projections on the subsets corresponding to each objective/constraint. Our framework encompasses existing techniques from the literature as well as novel algorithms. We investigate the potential of these approaches for a speech enhancement task. In particular, one of our novel algorithms outperforms other approaches in a realistic setting where the magnitudes are estimated beforehand using a neural network.
翻译:音频源分离通常通过估计每个源的幅度语谱图,然后应用相位恢复(或语谱图反演)算法来获取时域信号来实现。通常,语谱图反演被视作一个包含一个或多个项的优化问题,以促进满足一致性属性、混合约束和/或目标幅度目标的估计。尽管如此,在实践中哪种约束组合及问题表述最合适仍不明确。本文设计了一个通用的语谱图反演算法推导框架,该框架基于将各目标作为软惩罚或硬约束组合来构建优化问题。我们通过执行针对每个目标/约束对应子集的交替投影算法来求解这些问题。该框架涵盖了现有文献中的技术以及新颖算法。我们研究了这些方法在语音增强任务中的潜力。特别地,在一种实际场景中(即幅度由神经网络预先估计时),我们提出的一种新颖算法优于其他方法。