Learning 3D Robotics Perception using Inductive Priors

Recent advances in deep learning have led to a data-centric intelligence i.e. artificially intelligent models unlocking the potential to ingest a large amount of data and be really good at performing digital tasks such as text-to-image generation, machine-human conversation, and image recognition. This thesis covers the topic of learning with structured inductive bias and priors to design approaches and algorithms unlocking the potential of principle-centric intelligence. Prior knowledge (priors for short), often available in terms of past experience as well as assumptions of how the world works, helps the autonomous agent generalize better and adapt their behavior based on past experience. In this thesis, I demonstrate the use of prior knowledge in three different robotics perception problems. 1. object-centric 3D reconstruction, 2. vision and language for decision-making, and 3. 3D scene understanding. To solve these challenging problems, I propose various sources of prior knowledge including 1. geometry and appearance priors from synthetic data, 2. modularity and semantic map priors and 3. semantic, structural, and contextual priors. I study these priors for solving robotics 3D perception tasks and propose ways to efficiently encode them in deep learning models. Some priors are used to warm-start the network for transfer learning, others are used as hard constraints to restrict the action space of robotics agents. While classical techniques are brittle and fail to generalize to unseen scenarios and data-centric approaches require a large amount of labeled data, this thesis aims to build intelligent agents which require very-less real-world data or data acquired only from simulation to generalize to highly dynamic and cluttered environments in novel simulations (i.e. sim2sim) or real-world unseen environments (i.e. sim2real) for a holistic scene understanding of the 3D world.

翻译：深度学习的最新进展催生了以数据为中心的智能，即人工智能模型能够处理海量数据，并在文本到图像生成、人机对话和图像识别等数字任务中表现出色。本论文探讨如何利用结构化的归纳偏置和先验知识来设计方法与算法，从而释放以原理为中心智能的潜力。先验知识（简称先验）通常以过往经验及对世界运行方式的假设形式存在，有助于自主智能体更好地泛化并基于过往经验调整其行为。在本论文中，我展示了先验知识在三个不同机器人感知问题中的应用：1. 以物体为中心的三维重建，2. 用于决策的视觉与语言，以及3. 三维场景理解。为解决这些具有挑战性的问题，我提出了多种先验知识来源，包括：1. 来自合成数据的几何与外观先验，2. 模块化与语义地图先验，以及3. 语义、结构和上下文先验。我研究了这些先验在解决机器人三维感知任务中的作用，并提出了将其高效编码到深度学习模型中的方法。部分先验被用于通过预热网络实现迁移学习，另一些则作为硬约束以限制机器人智能体的动作空间。传统方法往往脆弱且难以泛化至未见场景，而以数据为中心的方法需要大量标注数据。相比之下，本论文旨在构建仅需极少真实世界数据或仅从仿真获取数据即可实现泛化的智能体，使其能够适应新型仿真（即sim2sim）或真实世界未见环境（即sim2real）中高度动态且杂乱的环境，从而实现对三维世界的整体场景理解。