We address the problem of learning representations from observations of a scene involving an agent and an external object the agent interacts with. To this end, we propose a representation learning framework extracting the location in physical space of both the agent and the object from unstructured observations of arbitrary nature. Our framework relies on the actions performed by the agent as the only source of supervision, while assuming that the object is displaced by the agent via unknown dynamics. We provide a theoretical foundation and formally prove that an ideal learner is guaranteed to infer an isometric representation, disentangling the agent from the object and correctly extracting their locations. We evaluate empirically our framework on a variety of scenarios, showing that it outperforms vision-based approaches such as a state-of-the-art keypoint extractor. We moreover demonstrate how the extracted representations enable the agent to solve downstream tasks via reinforcement learning in an efficient manner.
翻译:我们研究了从涉及智能体及其交互的外部对象的场景观测中学习表示的问题。为此,我们提出了一种表示学习框架,能够从任意性质的非结构化观测中提取智能体与对象在物理空间中的位置。该框架仅以智能体执行的动作作为监督信号,同时假设对象通过未知动力学被智能体移动。我们提供了理论基础,并严格证明理想学习者能够保证推断出等距表示,将智能体与对象解耦并正确提取其位置。我们在多种场景下对该框架进行了实证评估,结果表明其性能优于基于视觉的方法(如当前最先进的关键点提取器)。此外,我们还展示了提取的表示如何使智能体通过强化学习高效地解决下游任务。