A Dynamic Graph CNN with Cross-Representation Distillation for Event-Based Recognition

It is a popular solution to convert events into dense frame-based representations to use the well-pretrained CNNs in hand. Although with appealing performance, this line of work sacrifices the sparsity/temporal precision of events and usually necessitates heavy-weight models, thereby largely weakening the advantages and real-life application potential of event cameras. A more application-friendly way is to design deep graph models for learning sparse point-based representations from events. Yet, the efficacy of these graph models is far behind the frame-based counterpart with two key limitations: ($i$) simple graph construction strategies without carefully integrating the variant attributes (i.e., semantics, spatial and temporal coordinates) for each vertex, leading to biased graph representation; ($ii$) deficient learning because the lack of well pretraining models available. Here we solve the first problem by introducing a new event-based graph CNN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To alleviate the learning difficulty, we propose to leverage the dense representation counterpart of events as a cross-representation auxiliary to supply additional supervision and prior knowledge for the event graph. To this end, we form a frame-to-graph transfer learning framework with a customized hybrid distillation loss to well respect the varying cross-representation gaps across layers. Extensive experiments on multiple vision tasks validate the effectiveness and high generalization ability of our proposed model and distillation strategy (Core components of our codes are submitted with supplementary material and will be made publicly available upon acceptance)

翻译：将事件转换为密集的帧级表示以利用现有的预训练CNN是一种流行的解决方案。尽管性能优异，但此类工作牺牲了事件的稀疏性和时间精度，且通常需要重型模型，从而大大削弱了事件相机的优势及其实际应用潜力。一种更面向应用的方式是设计深度图模型，从事件中学习稀疏的点级表示。然而，这类图模型的效能远不及基于帧的对应方法，存在两个关键局限：（i）简单的图构建策略未仔细整合每个顶点的不同属性（即语义、空间和时间坐标），导致有偏的图表示；（ii）由于缺乏可用的预训练模型，学习能力不足。本文通过引入一种新型基于事件的图CNN（EDGCN）解决第一个问题，该网络配备动态聚合模块以自适应地整合顶点的所有属性。为缓解学习困难，我们提出利用事件的密集表示对应物作为跨表示的辅助信息，为事件图提供额外的监督和先验知识。为此，我们构建了一个帧到图的迁移学习框架，并采用定制的混合蒸馏损失函数，以充分尊重跨层变化的表示差异。在多个视觉任务上的大量实验验证了我们提出的模型与蒸馏策略的有效性和高泛化能力（我们代码的核心部分随补充材料提交，并在接收后公开）。