In this paper, we investigate whether artificial agents can develop a shared language in an ecological setting where communication relies on a sensory-motor channel. To this end, we introduce the Graphical Referential Game (GREG) where a speaker must produce a graphical utterance to name a visual referent object while a listener has to select the corresponding object among distractor referents, given the delivered message. The utterances are drawing images produced using dynamical motor primitives combined with a sketching library. To tackle GREG we present CURVES: a multimodal contrastive deep learning mechanism that represents the energy (alignment) between named referents and utterances generated through gradient ascent on the learned energy landscape. We demonstrate that CURVES not only succeeds at solving the GREG but also enables agents to self-organize a language that generalizes to feature compositions never seen during training. In addition to evaluating the communication performance of our approach, we also explore the structure of the emerging language. Specifically, we show that the resulting language forms a coherent lexicon shared between agents and that basic compositional rules on the graphical productions could not explain the compositional generalization.
翻译:本文探讨了在依赖于感觉运动通道进行通信的生态场景中,人工智能体能否发展出共享语言的问题。为此,我们引入了图形化参考游戏(GREG),其中说话者必须生成图形化话语来命名视觉参考对象,而听话者则需根据所传递的消息从干扰参考对象中选出对应对象。这些话语是由动态运动基元结合草图库生成的绘画图像。为解决GREG问题,我们提出了CURVES:一种多模态对比深度学习机制,该机制通过在学习到的能量景观上进行梯度上升,来表示命名参考对象与生成话语之间的能量(对齐程度)。我们证明,CURVES不仅能成功解决GREG,还能使智能体自组织出一种语言,这种语言能够泛化至训练中从未见过的特征组合。除了评估我们方法的通信性能外,我们还探索了涌现语言的结构。具体而言,我们展示了这种结果语言形成了智能体间共享的连贯词汇表,并且对图形化产物的基本组合规则无法解释这种组合泛化能力。