The autonomous evolution of networked AI systems relies heavily on robust environmental perception. However, physical understanding remains brittle in current models because key physical signals are visually ambiguous and sparsely represented in web-scale data. To bridge the gap between data-centric learning and knowledge-based physical rules, we present OmniFysics, a compact omni-modal network that unifies signal processing and understanding across images, audio, video, and text. To enable autonomous optimization and inject explicit physical knowledge, we construct a dynamic physical data engine. Within this engine, FysicsAny acts as an adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification. Concurrently, FysicsOmniCap distills web videos utilizing advanced audio-visual cross-modal signal processing, generating high-fidelity data pairs that emphasize dynamic physical cues. We optimize the OmniFysics network through staged multimodal alignment and evolutive instruction tuning, integrating latent-space flow matching for generation and an adaptive intent router for efficient execution. Experiments demonstrate that this evolutive optimization paradigm not only achieves competitive performance on standard multimodal benchmarks but also significantly advances physics-oriented evaluations.
翻译:自主演化的网络化AI系统高度依赖于鲁棒的环境感知能力。然而,当前模型对物理世界的理解仍显脆弱,原因在于关键物理信号在视觉上具有歧义性,且在网络规模数据中表征稀疏。为弥合数据驱动学习与基于知识的物理规则之间的鸿沟,我们提出OmniFysics——一个紧凑的全模态网络,该网络统一了图像、音频、视频与文本的信号处理与理解过程。为实现自主优化并注入显式物理知识,我们构建了一个动态物理数据引擎。在该引擎中,FysicsAny作为一种自适应机制发挥作用:它通过层级化检索与物理定律约束的信号验证,将显著对象映射至已验证的物理属性,从而生成基于物理先验的监督信号。与此同时,FysicsOmniCap利用先进的音视频跨模态信号处理技术对网络视频进行蒸馏,生成强调动态物理线索的高保真数据对。我们通过分阶段多模态对齐与演化式指令微调对OmniFysics网络进行优化,并集成了面向生成的潜空间流匹配机制与用于高效执行的意图自适应路由器。实验表明,这一演化式优化范式不仅能在标准多模态基准上取得具有竞争力的性能,更显著推进了物理导向型评估任务的表现。