Latest advances in deep spatial filtering for Ambisonics demonstrate strong performance in stationary multi-speaker scenarios by rotating the sound field toward a target speaker prior to multi-channel enhancement. For applicability in dynamic acoustic conditions with moving speakers, we propose to automate this rotary steering using an interleaved tracking algorithm conditioned on the target's initial direction. However, for nearby or crossing speakers, robust tracking becomes difficult and spatial cues less effective for enhancement. By incorporating the processed recording as additional guide into both algorithms, our novel joint autoregressive framework leverages temporal-spectral correlations of speech to resolve spatially challenging speaker constellations. Consequently, our proposed method significantly improves tracking and enhancement of closely spaced speakers, consistently outperforming comparable non-autoregressive methods on a synthetic dataset. Real-world recordings complement these findings in complex scenarios with multiple speaker crossings and varying speaker-to-array distances.
翻译:基于Ambisonics的深度空间滤波最新进展表明,通过在多通道增强前将声场旋转至目标说话者方向,可在静态多说话者场景中实现优异性能。为适应说话者移动的动态声学条件,我们提出采用一种以目标初始方向为条件的交错跟踪算法来自动化该旋转导向过程。然而,对于邻近或交叉运动的说话者,鲁棒跟踪变得困难且空间线索对增强效果减弱。通过将处理后的录音作为附加引导信息融入两个算法,我们新颖的联合自回归框架利用语音的时频相关性来解决空间分布复杂的说话者构型。因此,所提方法显著提升了紧密相邻说话者的跟踪与增强性能,在合成数据集上持续优于可比非自回归方法。真实场景录音在包含多次说话者交叉运动及可变说话者-阵列距离的复杂场景中进一步验证了这些结论。