In egocentric action recognition a single population model is typically trained and subsequently embodied on a head-mounted device, such as an augmented reality headset. While this model remains static for new users and environments, we introduce an adaptive paradigm of two phases, where after pretraining a population model, the model adapts on-device and online to the user's experience. This setting is highly challenging due to the change from population to user domain and the distribution shifts in the user's data stream. Coping with the latter in-stream distribution shifts is the focus of continual learning, where progress has been rooted in controlled benchmarks but challenges faced in real-world applications often remain unaddressed. We introduce EgoAdapt, a benchmark for real-world egocentric action recognition that facilitates our two-phased adaptive paradigm, and real-world challenges naturally occur in the egocentric video streams from Ego4d, such as long-tailed action distributions and large-scale classification over 2740 actions. We introduce an evaluation framework that directly exploits the user's data stream with new metrics to measure the adaptation gain over the population model, online generalization, and hindsight performance. In contrast to single-stream evaluation in existing works, our framework proposes a meta-evaluation that aggregates the results from 50 independent user streams. We provide an extensive empirical study for finetuning and experience replay.
翻译:在自我中心动作识别中,通常训练一个通用群体模型,随后将其部署到头戴式设备(如增强现实头显)上。尽管该模型对新用户和新环境保持静态,我们提出了一种双阶段自适应范式:在预训练群体模型后,模型在设备端在线适应用户的个人体验。由于从群体域到用户域的转变以及用户数据流中的分布偏移,这一设定极具挑战性。应对数据流内的分布偏移是持续学习的核心目标,现有进展基于受控基准测试,但现实应用中的挑战常未得到解决。我们提出EgoAdapt——一个针对真实世界自我中心动作识别的基准测试,支持双阶段自适应范式,并自然引入Ego4d自我中心视频流中的现实挑战,如长尾动作分布及涵盖2740种动作的大规模分类。我们构建了一种直接利用用户数据流的评估框架,通过新指标衡量相较于群体模型的自适应增益、在线泛化能力及事后性能。与现有研究中的单流评估不同,我们的框架提出一种元评估方法,聚合50个独立用户数据流的评估结果。我们针对微调与经验回放方法开展了广泛的实证研究。