Inspired by the human ability to understand and predict others, we study the applicability of Conditional Neural Processes (CNP) to the task of self-supervised multimodal action prediction in robotics. Following recent results regarding the ontogeny of the Mirror Neuron System (MNS), we focus on the preliminary objective of self-actions prediction. We find a good MNS-inspired model in the existing Deep Modality Blending Network (DMBN), able to reconstruct the visuo-motor sensory signal during a partially observed action sequence by leveraging the probabilistic generation of CNP. After a qualitative and quantitative evaluation, we highlight its difficulties in generalizing to unseen action sequences, and identify the cause in its inner representation of time. Therefore, we propose a revised version, termed DMBN-Positional Time Encoding (DMBN-PTE), that facilitates learning a more robust representation of temporal information, and provide preliminary results of its effectiveness in expanding the applicability of the architecture. DMBN-PTE figures as a first step in the development of robotic systems that autonomously learn to forecast actions on longer time scales refining their predictions with incoming observations.
翻译:受人类理解与预测他人行为能力的启发,我们研究了条件神经过程(CNP)在机器人自监督多模态行为预测任务中的适用性。基于镜像神经元系统(MNS)个体发生学的最新研究成果,我们聚焦于自身行为预测这一初步目标。在现有深度模态融合网络(DMBN)中,我们发现了一个良好的MNS启发模型,该模型通过利用CNP的概率生成能力,能够在部分观测的行为序列中重建视觉-运动感觉信号。经过定性与定量评估后,我们指出该模型在泛化至未见行为序列方面的困难,并识别出问题根源在于其内部的时间表征。为此,我们提出改进版本DMBN-位置时间编码(DMBN-PTE),该模型有助于学习更鲁棒的时间信息表征,并初步验证了其在扩展架构适用性方面的有效性。DMBN-PTE可视为自主学会随时间尺度扩展预测行为、并通过新观测逐步修正预测的机器人系统开发的初步探索。