Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments composed of the upper body of an individual, while the text encoder handles textual descriptions automatically generated through prompt engineering. Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
翻译:语音活动检测(VAD)是自动判断个体是否在说话并识别其在视听数据中语音时机的过程。传统上,该任务通常通过处理音频信号或视觉数据,或通过融合或联合学习结合两种模态来解决。在本研究中,受视觉-语言模型最新进展的启发,我们提出了一种利用对比语言-图像预训练(CLIP)模型的新方法。CLIP视觉编码器分析由个体上半身组成的视频片段,而文本编码器处理通过提示工程自动生成的文本描述。随后,这些编码器的嵌入表示通过深度神经网络进行融合以执行VAD。我们在三个VAD基准测试上的实验分析表明,相较于现有的视觉VAD方法,我们的方法展现出更优越的性能。值得注意的是,尽管方法简洁且无需在大量视听数据集上进行预训练,我们的方法仍超越了多种视听融合方法。