Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.
翻译:人类通过整合视觉和听觉线索自然地理解视频中的时刻。例如,定位视频中的场景,如“一位科学家在激昂的管弦乐中热情地讲述野生动物保护,观众点头并鼓掌”,需要同时处理视觉、音频和语音信号。然而,现有模型往往难以有效融合和解释音频信息,限制了其全面理解视频时序的能力。为解决这一问题,我们提出了TriSense,一个为通过整合视觉、音频和语音模态实现整体视频时序理解而设计的三模态大语言模型。TriSense的核心是一个基于查询的连接器,它能根据输入查询自适应地重新加权各模态的贡献,从而在模态缺失情况下实现稳健性能,并允许灵活组合可用输入。为支持TriSense的多模态能力,我们引入了TriSense-2M,这是一个包含超过200万个精选样本的高质量数据集,由经过微调的大语言模型驱动的自动化流程生成。TriSense-2M包含长视频和多样化的模态组合,有助于广泛的泛化。在多个基准测试上的大量实验证明了TriSense的有效性及其推动多模态视频分析的潜力。代码和数据集将公开发布。