Humans navigate in their environment by learning a mental model of the world through passive observation and active interaction. Their world model allows them to anticipate what might happen next and act accordingly with respect to an underlying objective. Such world models hold strong promises for planning in complex environments like in autonomous driving. A human driver, or a self-driving system, perceives their surroundings with their eyes or their cameras. They infer an internal representation of the world which should: (i) have spatial memory (e.g. occlusions), (ii) fill partially observable or noisy inputs (e.g. when blinded by sunlight), and (iii) be able to reason about unobservable events probabilistically (e.g. predict different possible futures). They are embodied intelligent agents that can predict, plan, and act in the physical world through their world model. In this thesis we present a general framework to train a world model and a policy, parameterised by deep neural networks, from camera observations and expert demonstrations. We leverage important computer vision concepts such as geometry, semantics, and motion to scale world models to complex urban driving scenes. First, we propose a model that predicts important quantities in computer vision: depth, semantic segmentation, and optical flow. We then use 3D geometry as an inductive bias to operate in the bird's-eye view space. We present for the first time a model that can predict probabilistic future trajectories of dynamic agents in bird's-eye view from 360{\deg} surround monocular cameras only. Finally, we demonstrate the benefits of learning a world model in closed-loop driving. Our model can jointly predict static scene, dynamic scene, and ego-behaviour in an urban driving environment.
翻译:人类通过被动观察和主动交互学习环境的心理模型,从而在环境中导航。其世界模型使人类能够预判后续可能发生的事件,并依据潜在目标做出相应行动。这类世界模型为自动驾驶等复杂环境的规划任务提供了重要前景。人类驾驶员或自动驾驶系统通过眼睛或摄像头感知周围环境,推演出世界的内部表征,该表征应具备:(i) 空间记忆能力(如处理遮挡),(ii) 填补部分可观测或含噪输入的能力(如阳光直射致盲场景),(iii) 以概率方式推理不可观测事件的能力(如预测多种未来轨迹)。作为具身智能体,它们通过世界模型实现对物理世界的预测、规划与行动。本论文提出一个通用框架,可从摄像头观测数据和专家示范中训练由深度神经网络参数化的世界模型与策略。我们利用几何、语义与运动等计算机视觉核心概念,将世界模型扩展至复杂城市场景。首先提出可预测深度、语义分割与光流等计算机视觉关键量的模型,继而以三维几何作为归纳偏置在鸟瞰视角空间进行操作。首次提出仅通过360度环绕单目摄像头即可预测鸟瞰视角下动态智能体概率未来轨迹的模型。最后论证了在闭环驾驶中学习世界模型的价值——该模型可联合预测城市驾驶环境中的静态场景、动态场景及自车行为。