Always-on egocentric cameras are increasingly used as demonstrations for embodied robotics, imitation learning, and assistive AR, but the resulting video streams are dominated by redundant and low-quality frames. Under the storage and battery constraints of wearable devices, choosing which frames to keep is as important as how to learn from them. We observe that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty). We operationalize this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty. On the Visual Experience Dataset (VEDB), curated frames at 10% budget match the classification performance of the full stream, and naive signal fusion consistently destroys both contributions. The benefit is task-dependent: pupil ranking improves activity recognition, while gaze-only selection already dominates for scene recognition, confirming that the two signals serve genuinely different roles. Our method requires no model inference and operates at capture time, offering a path toward efficient, always-on egocentric data curation.
翻译:始终开启的第一人称摄像头日益被用作具身机器人、模仿学习及辅助增强现实的演示工具,但由此产生的视频流充斥着大量冗余与低质量帧。在可穿戴设备的存储与电池限制下,选择保留哪些帧与如何从中学习同等重要。我们观察到,现代眼动追踪头显提供了一个连续、无需训练的辅助通道,可分解为两个互补维度:凝视固定捕捉视觉稳定性(质量),而瞳孔反应则捕捉与唤醒相关的关键瞬间(新奇性)。我们将这一洞见具体化为双准则帧筛选器,该筛选器首先通过凝视质量对帧进行门控筛选,随后依据瞳孔衍生的新奇性对保留帧进行排序。在视觉经验数据集(VEDB)上,以10%预算筛选的帧在分类性能上匹配完整视频流,而简单的信号融合则会持续破坏两者的贡献。其优势具有任务依赖性:瞳孔排序提升了活动识别性能,而仅基于凝视的筛选在场景识别任务中已占据主导地位,这证实了两种信号确实发挥着不同的作用。我们的方法无需模型推理,可在捕获时实时运行,为高效、始终开启的第一人称数据筛选提供了一条可行路径。