Multimodal large language models have recently achieved remarkable progress in video question answering (VideoQA) by jointly processing visual, textual, and audio information. However, it remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency. In this work, we present a comprehensive empirical study of video representation methods for VideoQA with MLLMs. We systematically evaluate single modality inputs question only, subtitles, visual frames, and audio signals as well as multimodal combinations, on two widely used benchmarks: VideoMME and LongVideoBench. Our results show that visual frames substantially enhance accuracy but impose heavy costs in GPU memory and inference latency, while subtitles provide a lightweight yet effective alternative, particularly for long videos. These findings highlight clear trade-offs between effectiveness and efficiency and provide practical insights for designing resource-aware MLLM-based VideoQA systems.
翻译:多模态大语言模型近期通过联合处理视觉、文本与音频信息,在视频问答领域取得了显著进展。然而,何种视频表征对MLLMs最为有效,以及不同模态如何在任务精度与计算效率之间取得平衡,目前尚不明确。本研究针对基于MLLMs的VideoQA任务,对视频表征方法进行了系统的实证分析。我们在VideoMME和LongVideoBench两个广泛使用的基准数据集上,系统评估了单一模态输入(仅问题、字幕、视觉帧、音频信号)以及多模态组合的表现。实验结果表明:视觉帧能显著提升精度,但会带来GPU内存占用与推理延迟的沉重负担;而字幕则提供了一种轻量且有效的替代方案,尤其适用于长视频场景。这些发现揭示了效果与效率间明确的权衡关系,为设计资源感知的基于MLLM的VideoQA系统提供了实践指导。