This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.
翻译:本文提出StreamChat,一种增强大型多模态模型(LMM)与流媒体视频内容交互能力的新方法。在流媒体交互场景中,现有方法仅依赖于提问时刻可用的视觉信息,导致显著延迟,因为模型无法感知流媒体视频的后续变化。StreamChat通过创新性地在每个解码步骤更新视觉上下文来解决这一局限,确保模型在整个解码过程中利用最新的视频内容。此外,我们引入了一种灵活高效的基于交叉注意力的架构来处理动态流媒体输入,同时保持流媒体交互的推理效率。进一步地,我们构建了一个新的密集指令数据集以促进流媒体交互模型的训练,并辅以并行3D-RoPE机制来编码视觉与文本标记的相对时序信息。实验结果表明,StreamChat在现有图像与视频基准测试中取得了有竞争力的性能,并在流媒体交互场景中相较于最先进的视频LMM展现出更优的能力。