InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Pan Zhang,Xiaoyi Dong,Yuhang Cao,Yuhang Zang,Rui Qian,Xilin Wei,Lin Chen,Yifei Li,Junbo Niu,Shuangrui Ding,Qipeng Guo,Haodong Duan,Xin Chen,Han Lv,Zheng Nie,Min Zhang,Bin Wang,Wenwei Zhang,Xinyue Zhang,Jiaye Ge,Wei Li,Jingwen Li,Zhongying Tu,Conghui He,Xingcheng Zhang,Kai Chen,Yu Qiao,Dahua Lin,Jiaqi Wang

from arxiv, Github Repo: https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-OmniLive

Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.

翻译：构建能够像人类认知一样与环境进行长时交互的人工智能系统，一直是一个长期的研究目标。近年来，多模态大语言模型（MLLMs）在开放世界理解方面取得了显著进展。然而，持续且同步的流式感知、记忆与推理这一挑战在很大程度上仍未得到探索。当前的MLLMs受限于其序列到序列的架构，这限制了它们同时处理输入和生成响应的能力，类似于无法在感知的同时进行思考。此外，依赖长上下文来存储历史数据对于长时交互是不切实际的，因为保留所有信息会变得成本高昂且效率低下。因此，本项目并未依赖单一的基础模型来执行所有功能，而是从"专业化通才人工智能"的概念中汲取灵感，引入了解耦的流式感知、推理和记忆机制，从而实现了与流式视频和音频输入的实时交互。所提出的框架InternLM-XComposer2.5-OmniLive（IXC2.5-OL）包含三个关键模块：（1）流式感知模块：实时处理多模态信息，将关键细节存储于记忆中，并响应用户查询触发推理。（2）多模态长时记忆模块：整合短期与长期记忆，将短期记忆压缩为长期记忆，以实现高效检索并提升准确性。（3）推理模块：响应查询并执行推理任务，与感知和记忆模块协同工作。本项目模拟了类人认知，使得多模态大语言模型能够随时间推移提供持续且自适应的服务。