This paper proposes a novel problem: vision-based perception to learn and predict the collective dynamics of multi-agent systems, specifically focusing on interaction strength and convergence time. Multi-agent systems are defined as collections of more than ten interacting agents that exhibit complex group behaviors. Unlike prior studies that assume knowledge of agent positions, we focus on deep learning models to directly predict collective dynamics from visual data, captured as frames or events. Due to the lack of relevant datasets, we create a simulated dataset using a state-of-the-art flocking simulator, coupled with a vision-to-event conversion framework. We empirically demonstrate the effectiveness of event-based representation over traditional frame-based methods in predicting these collective behaviors. Based on our analysis, we present event-based vision for Multi-Agent dynamic Prediction (evMAP), a deep learning architecture designed for real-time, accurate understanding of interaction strength and collective behavior emergence in multi-agent systems.
翻译:本文提出一个新颖问题:通过基于视觉的感知来学习并预测多智能体系统的集体动力学,特别聚焦于交互强度与收敛时间。多智能体系统被定义为包含十个以上交互智能体的集合,其展现出复杂的群体行为。与先前研究假设已知智能体位置不同,我们专注于通过深度学习模型直接从视觉数据(以帧或事件形式捕获)预测集体动力学。由于缺乏相关数据集,我们使用最先进的集群模拟器结合视觉到事件转换框架,创建了一个模拟数据集。我们通过实验证明了基于事件的表示在预测这些集体行为方面优于传统的基于帧的方法。基于分析,我们提出了用于多智能体动态预测的事件视觉模型(evMAP),这是一种专为实时、准确理解多智能体系统中交互强度与集体行为涌现而设计的深度学习架构。