The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling the separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to the original backbones. Audio samples are available at: https://plugandsteer.github.io
翻译:本文旨在通过解耦分离与目标选择,为音视频目标说话人提取(AV-TSE)提供新视角。传统 AV-TSE 系统通常深度融合音频与视觉特征以重新学习整个分离过程,但因野外音视频数据集的噪声特性,这种方法可能成为保真度的天花板。为解决此问题,我们提出即插即选(Plug-and-Steer),将高保真分离任务分配给冻结的纯音频骨干网络,并将视觉模态的作用严格限定于目标选择。我们引入潜在导向矩阵(Latent Steering Matrix, LSM),这是一种极简的线性变换,用于在骨干网络内部重新路由潜在特征,从而将目标说话人锚定到指定通道。在四种代表性架构上的实验表明,我们的方法有效保留了不同骨干网络的声学先验,实现了与原始骨干网络相当的主观感知质量。音频样本参见:https://plugandsteer.github.io