Audio-Visual Speaker Detection (AVSD) hinges on modeling both individual temporal continuity and inter-personal social context. Existing coupled architectures struggle to reconcile these tasks in shared representation spaces due to conflicting inductive biases: temporal modeling favors low-frequency smoothness, while inter-personal interaction requires high-frequency discriminability. We propose D$^2$Stream, a decoupled dual-stream framework that explicitly isolates these functionalities into parallel, task-specific branches. Specifically, the Intra-speaker Temporal Continuity (ITC) stream captures longitudinal stability, whereas the Inter-personal Social Relation (ISR) stream models transversal social cues. Quantitative gradient analysis reveals an evolutionary divergence in update directions, stabilizing at 86.1°, which confirms the inherent task conflict and the effectiveness of our structural decoupling. D$^2$Stream breaks the long-standing performance plateau, achieving a state-of-the-art 95.6% mAP on AVA-ActiveSpeaker and superior generalization on Columbia ASD, all within a lightweight and efficient design.
翻译:音视频说话人检测(AVSD)既需要建模个体时间连续性,也需要建模人际社交上下文。现有耦合式架构因存在冲突的归纳偏置,难以在共享表征空间中协调这两类任务:时间建模偏好低频平滑性,而人际交互需要高频判别性。本文提出D$^2$Stream——一种解耦的双流框架,将上述功能显式分离为并行的任务专用分支。具体而言,说话者内时间连续性流(ITC)捕获纵向稳定性,而人际社交关系流(ISR)建模横向社交线索。定量梯度分析揭示了更新方向的演化分歧,最终稳定在86.1°,这证实了任务间固有的冲突以及结构解耦的有效性。D$^2$Stream打破了长期存在的性能瓶颈,在AVA-ActiveSpeaker数据集上达到95.6% mAP的当前最优水平,并在Columbia ASD数据集上展现出优异的泛化能力,且所有性能均在轻量高效的设计框架内实现。