Object-centric representation is an essential abstraction for forward prediction. Most existing forward models learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, novel backgrounds, and unseen object geometries. Experiments demonstrate the effectiveness of our model in accurately performing forward prediction and learning plannable object-centric representations for downstream robotic pushing manipulation tasks.
翻译:以物体为中心的表示是前向预测的重要抽象。大多数现有前向模型通过大量监督信息(如物体类别和边界框)学习这种表示,但现实中此类真实标注信息难以获取。为此,我们提出KINet(关键点交互网络)——一种基于关键点表示推理物体交互的端到端无监督框架。该模型通过视觉观测学习将物体与关键点坐标关联,发现系统作为一组关键点嵌入及其关系的图表示。进而利用对比估计学习动作条件前向模型,预测未来关键点状态。通过将物理推理迁移至关键点空间,模型自动泛化至不同物体数量、新颖背景及未见几何形状的场景。实验证明,该模型在前向预测的准确性及为下游机器人推操作任务学习可规划物体中心表示方面均具有效性。