Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.
翻译:事件相机以微秒级分辨率捕捉每个像素的亮度变化,提供RGB帧间丢失的连续运动信息。然而,现有基于事件的运动估计器依赖大规模合成数据,这类数据往往存在显著的仿真到真实域差距。我们提出TETO(利用教师观察跟踪事件),一种基于教师-学生框架的方法,通过从预训练的RGB跟踪器进行知识蒸馏,仅需约25分钟无标注真实世界录音即可学习事件运动估计。我们的运动感知数据整理与查询采样策略通过解耦物体运动与主导自运动,最大化从有限数据中的学习效果。所得估计器联合预测点轨迹与密集光流,我们将其作为显式运动先验条件,用于调节预训练视频扩散Transformer以实现帧插值。在使用数量级更少的训练数据条件下,我们在EVIMO2上取得顶尖的点跟踪性能,在DSEC上获得最优光流估计,并证明精准运动估计可直接转化为BS-ERGB与HQ-EVFI上更优的帧插值质量。