We propose an optimization-based method for reconstructing a time-domain signal from a low-dimensional spectral representation such as a mel-spectrogram. Phase reconstruction has been studied to reconstruct a time-domain signal from the full-band short-time Fourier transform (STFT) magnitude. The Griffin-Lim algorithm (GLA) has been widely used because it relies only on the redundancy of STFT and is applicable to various audio signals. In this paper, we jointly reconstruct the full-band magnitude and phase by considering the bi-level relationships among the time-domain signal, its STFT coefficients, and its mel-spectrogram. The proposed method is formulated as a rigorous optimization problem and estimates the full-band magnitude based on the criterion used in GLA. Our experiments demonstrate the effectiveness of the proposed method on speech, music, and environmental signals.
翻译:我们提出了一种基于优化的方法,用于从低维频谱表示(如梅尔频谱)中重构时域信号。相位重构通常用于从全频带短时傅里叶变换(STFT)幅度中恢复时域信号。格里芬-林算法(GLA)因仅依赖于STFT的冗余性且适用于各类音频信号而得到广泛应用。本文通过考虑时域信号、其STFT系数及其梅尔频谱之间的双层关系,联合重构全频带幅度与相位。所提方法被形式化为严格优化问题,并基于GLA准则估计全频带幅度。实验结果表明,该方法在语音、音乐及环境信号上均具有有效性。