An Efficient and Streaming Audio Visual Active Speaker Detection System

This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. By doing so, we alleviate the need for processing the entire sequence of future frames before a decision is made, significantly reducing latency. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference. This tackles the persistent memory issues associated with running streaming ASD systems. Beyond these theoretical frameworks, we conduct extensive experiments to validate our approach. Our results demonstrate that constrained transformer models can achieve performance comparable to or even better than state-of-the-art recurrent models, such as uni-directional GRUs, with a significantly reduced number of context frames. Moreover, we shed light on the temporal memory requirements of ASD systems, revealing that larger past context has a more profound impact on accuracy than future context. When profiling on a CPU we find that our efficient architecture is memory bound by the amount of past context it can use and that the compute cost is negligible as compared to the memory cost.

翻译：本文深入探讨了主动说话人检测（ASD）这一具有挑战性的任务，该任务要求系统能够实时判断视频帧序列中的人物是否正在说话。尽管先前的研究在网络架构改进和学习有效的ASD表征方面取得了显著进展，但在实时系统部署的探索方面仍存在关键空白。现有模型通常存在高延迟和高内存占用的问题，使其难以直接应用于即时场景。为弥补这一不足，我们提出了两种应对实时约束关键挑战的方案。首先，我们引入了一种限制ASD模型所利用的未来上下文帧数量的方法。通过这种方式，我们减轻了在做出决策前处理整个未来帧序列的需求，从而显著降低了延迟。其次，我们提出了一种更为严格的约束，限制模型在推理过程中能够访问的过去帧总数。这解决了运行流式ASD系统时持续存在的内存问题。除了这些理论框架，我们还进行了大量实验以验证我们的方法。实验结果表明，在上下文帧数量显著减少的情况下，受限的Transformer模型能够达到甚至优于最先进的循环模型（如单向GRU）的性能。此外，我们揭示了ASD系统对时序内存的需求，发现较大的过去上下文对准确性的影响比未来上下文更为深远。在CPU上进行性能分析时，我们发现我们提出的高效架构受限于其可使用的过去上下文量，且计算成本与内存成本相比可忽略不计。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日