Latest advances in deep spatial filtering for Ambisonics demonstrate strong performance in stationary multi-speaker scenarios by rotating the sound field toward a target speaker prior to multi-channel enhancement. For applicability in dynamic acoustic conditions with moving speakers, we propose to automate this rotary steering using an interleaved tracking algorithm conditioned on the target's initial direction. However, for nearby or crossing speakers, robust tracking becomes difficult and spatial cues less effective for enhancement. By incorporating the processed recording as additional guide into both algorithms, our novel joint autoregressive framework leverages temporal-spectral correlations of speech to resolve spatially challenging speaker constellations. Consequently, our proposed method significantly improves tracking and enhancement of closely spaced speakers, consistently outperforming comparable non-autoregressive methods on a synthetic dataset. Real-world recordings complement these findings in complex scenarios with multiple speaker crossings and varying speaker-to-array distances.
翻译:基于Ambisonics的深度空间滤波最新进展表明,通过将声场旋转至目标说话人方向再进行多通道增强,可在静态多说话人场景中实现优异性能。为适应说话人运动的动态声学环境,我们提出利用以目标初始方向为条件的交错跟踪算法实现旋转波束导向的自动化。然而,对于邻近或交叉运动的说话人,鲁棒跟踪变得困难且空间线索对增强效果减弱。通过将处理后的录音作为附加引导信息融入两个算法,我们提出的新型联合自回归框架利用语音的时频相关性来解决空间分布复杂的说话人构型。实验表明,所提方法显著提升了紧密相邻说话人的跟踪与增强性能,在合成数据集上持续优于同类非自回归方法。在包含多说话人交叉运动及可变说话人-阵列距离的复杂场景中,真实环境录音进一步验证了这些结论。