Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.
翻译:现代视觉智能体需要在实时流式环境中运行,这要求其表征具备通用性、因果性和物理结构。然而,当前的视觉基础模型仍然处于割裂状态,各自狭隘地专精于图像语义感知、离线时序建模或空间几何理解。本文提出了OmniStream,一个统一的流式视觉骨干网络,能够从多样化的视觉输入中有效地进行感知、重建和行动。通过融入因果时空注意力和三维旋转位置编码(3D-RoPE),我们的模型支持通过持久的KV缓存对视频流进行高效的逐帧在线处理。我们利用一个协同的多任务框架对OmniStream进行预训练,该框架耦合了静态与时序表征学习、流式几何重建以及视觉-语言对齐,共使用了29个数据集。广泛的评估表明,即使在骨干网络严格冻结的情况下,OmniStream在图像与视频探测、流式几何重建、复杂视频与空间推理以及机器人操控(训练中未见)等任务上,均能与专门的专家模型取得持续且具有竞争力的性能。我们的工作并非追求在特定基准测试上的主导地位,而是证明了训练一个单一、通用的视觉骨干网络,使其能够泛化到语义、空间和时序推理中是可行的。这朝着为交互式和具身智能体实现通用视觉理解的目标迈出了更有意义的一步。