4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding

Current 4D representations decouple geometry, motion, and semantics: reconstruction methods discard interpretable motion structure; language-grounded methods attach semantics after motion is learned, blind to how objects move; and motion-aware methods encode dynamics as opaque per-point residuals without object-level organization. We propose 4D Synchronized Fields, a 4D Gaussian representation that learns object-factored motion in-loop during reconstruction and synchronizes language to the resulting kinematics through a per-object conditioned field. Each Gaussian trajectory is decomposed into shared object motion plus an implicit residual, and a kinematic-conditioned ridge map predicts temporal semantic variation, yielding a single representation in which reconstruction, motion, and semantics are structurally coupled and enabling open-vocabulary temporal queries that retrieve both objects and moments. On HyperNeRF, 4D Synchronized Fields achieves 28.52 dB mean PSNR, the highest among all language-grounded and motion-aware baselines, within 1.5 dB of reconstruction-only methods. On targeted temporal-state retrieval, the kinematic-conditioned field attains 0.884 mean accuracy, 0.815 mean vIoU, and 0.733 mean tIoU, surpassing 4D LangSplat (0.620, 0.433, and 0.439 respectively) and LangSplat (0.415, 0.304, and 0.262). Ablation confirms that kinematic conditioning is the primary driver, accounting for +0.45 tIoU over a static-embedding-only baseline. 4D Synchronized Fields is the only method that jointly exposes interpretable motion primitives and temporally grounded language fields from a single trained representation. Code will be released.

翻译：现有四维表示方法将几何、运动与语义解耦：重建方法丢弃可解释的运动结构；语言锚定方法在运动学习后附加语义，无法感知物体运动方式；而运动感知方法将动态编码为不透明的逐点残差，缺乏对象级组织。我们提出四维同步场，这是一种在重建过程中同步学习对象分解运动，并通过逐对象条件场将语言与运动学结果同步的四维高斯表示。每个高斯轨迹被分解为共享对象运动与隐式残差之和，运动学条件脊图预测时序语义变化，从而形成重建、运动与语义结构耦合的单一表示，支持同时检索物体与时刻的开放词汇时序查询。在HyperNeRF数据集上，四维同步场达到28.52 dB平均PSNR，在所有语言锚定与运动感知基线中位列第一，与纯重建方法的差距仅为1.5 dB。在定向时序状态检索任务中，运动学条件场获得0.884平均准确率、0.815平均vIoU与0.733平均tIoU，显著超越4D LangSplat（对应0.620、0.433、0.439）和LangSplat（对应0.415、0.304、0.262）。消融实验证实运动学条件是关键驱动因素，相比静态嵌入基线带来+0.45 tIoU提升。四维同步场是唯一能从单一训练表示中同时提取可解释运动基元与时间锚定语言场的方法。代码即将开源。