Event cameras provide superior temporal resolution, dynamic range, energy efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches combining Artificial Neural Networks (ANNs) and SNNs suffer from suboptimal architectures that compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based \textbf{S}pike-\textbf{D}riven \textbf{T}racking (SDTrack) pipeline. It incorporates a novel event frame aggregation method called Global Trajectory Prompt (GTP) and a Transformer-based tracker. The GTP method effectively captures global trajectory information and aggregates it with event streams into event frames to enhance spatiotemporal representation. The Transformer-based tracker comprises a fully spike-driven SNN backbone and a simple tracking head. The SDTrack pipeline operates end-to-end without data augmentation or post-processing. Extensive experiments demonstrate that our SDTrack-Tiny pipeline achieves competitive accuracy with only 19.61$M$ parameters and 8.16$mJ$ energy consumption, while our Base version achieves state-of-the-art accuracy across three datasets. Our work establishes a solid foundation for future neuromorphic vision research.
翻译:事件相机具有卓越的时间分辨率、动态范围、能效和像素带宽。脉冲神经网络通过离散脉冲信号自然适配事件数据,使其成为事件驱动跟踪的理想选择。然而,当前结合人工神经网络与脉冲神经网络的方法存在次优架构,既损害能效又限制跟踪性能。为解决这些局限,我们首次提出基于Transformer的脉冲驱动跟踪流程:SDTrack。该流程包含一种名为全局轨迹提示的创新事件帧聚合方法,以及一个基于Transformer的跟踪器。GTP方法有效捕获全局轨迹信息,并将其与事件流聚合为事件帧以增强时空表征。基于Transformer的跟踪器由全脉冲驱动的脉冲神经网络骨干网络和简洁跟踪头组成。SDTrack流程以端到端方式运行,无需数据增强或后处理。大量实验表明,我们的SDTrack-Tiny流程仅用1961万参数和816毫焦能耗即达到竞争性精度,而Base版本在三个数据集上均实现最优跟踪精度。本工作为未来神经形态视觉研究奠定了坚实基础。