Temporal video grounding (TVG) is a critical task in video content understanding, requiring precise alignment between video content and natural language instructions. Despite significant advancements, existing methods face challenges in managing confidence bias towards salient objects and capturing long-term dependencies in video sequences. To address these issues, we introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding. Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages in handling different aspects of the task. Specifically, we use SNNs to develop a spiking saliency detector that generates the proposal set. The detector emits spike signals when the input signal exceeds a predefined threshold, resulting in a dynamic and binary saliency proposal set. To enhance the model's capability to retain and infer contextual information, we introduce relevant slots which learnable tensors that encode prior knowledge. These slots work with the contextual moment reasoner to maintain a balance between preserving contextual information and exploring semantic relevance dynamically. The SSMs facilitate selective information propagation, addressing the challenge of long-term dependency in video content. By combining SNNs for proposal generation and SSMs for effective contextual reasoning, SpikeMba addresses confidence bias and long-term dependencies, thereby significantly enhancing fine-grained multimodal relationship capture. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods across mainstream benchmarks.
翻译:时序视频定位是视频内容理解中的一项关键任务,要求精确对齐视频内容与自然语言指令。尽管取得了显著进展,现有方法在管理对显著对象的置信度偏差以及捕捉视频序列中的长期依赖关系方面仍面临挑战。为解决这些问题,我们提出了SpikeMba:一种用于时序视频定位的多模态脉冲显著性Mamba模型。我们的方法将脉冲神经网络与状态空间模型相结合,以利用它们在处理任务不同方面的独特优势。具体而言,我们使用SNN开发了一个脉冲显著性检测器来生成候选片段集。当输入信号超过预设阈值时,该检测器会发出脉冲信号,从而产生动态的二进制显著性候选集。为了增强模型保留和推断上下文信息的能力,我们引入了相关槽位——这些可学习的张量编码了先验知识。这些槽位与上下文片段推理器协同工作,以在保留上下文信息和动态探索语义相关性之间保持平衡。SSM促进了选择性信息传播,解决了视频内容中长期依赖关系的挑战。通过结合用于候选片段生成的SNN和用于有效上下文推理的SSM,SpikeMba解决了置信度偏差和长期依赖性问题,从而显著增强了细粒度多模态关系捕捉能力。我们的实验证明了SpikeMba的有效性,其在主流基准测试中始终优于最先进的方法。