Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.
翻译:摘要:尽管近期取得进展,视觉语言模型(VLM)仍难以理解世界的动态性。我们注意到,对4D场景的推理能力本身具有挑战性,且因两个因素而进一步复杂化:其一,VLM通过二维图像的投影间接观测运动;其二,现有数据集无法有效分离物体运动与相机运动。为解决这些问题,我们提出了一种聚焦运动相关场景理解的问答生成流程。通过传统跟踪方式与新型固定参考系(称为真运动跟踪(True-Motion Tracking))相结合的方法,我们特别关注相机运动与物体运动的纠缠问题,该固定参考系能为运动提供直观描述。基于该流程,我们生成了包含40万样本的大规模训练数据集4DP-QA(4D感知问答)以及含2200样本的基准测试集4DP-QA-Bench。利用该数据集训练现有模型可在外部基准测试上取得性能提升,验证了本方法的有效性。