TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $π_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.

翻译：主动视觉——即策略在操作过程中自主控制注视方向的能力——已成为模仿学习中的关键能力，过去一年中多个独立系统展示了其优势。然而，目前缺乏共享基准来比较不同方法，或量化主动视觉在何种任务类型、何种条件下具有贡献。我们提出TAVIS，一个面向主动视觉模仿学习的评估基础设施，包含两个互补任务套件——TAVIS-Head（5个任务，通过俯仰/偏转颈部进行全局搜索）和TAVIS-Hands（3个任务，通过腕部相机处理局部遮挡）——分别部署于两种人形躯干形态（GR1T2、Reachy2），并基于IsaacLab构建。TAVIS提供三种评估基元：（1）配对头戴相机与固定相机的协议（使用相同演示数据）；（2）GALT（注视-动作领先时间），一种植根于认知科学与HRI的新指标，用于量化学习策略中的预期注视行为；（3）程序化ID/OOD划分。基于扩散策略与$π_0$的基线实验表明：（i）主动视觉通常具有帮助，但收益具有任务条件依赖性而非均匀分布；（ii）多任务策略在受控分布偏移下性能急剧下降（两个套件均存在此现象）；（iii）仅通过模仿即可获得预期注视行为，其中位领先时间与人类操作员参考值相当。代码、评估脚本、演示数据（LeRobot v3.0；约2200个回合）及训练基线已发布于https://github.com/spiglerg/tavis 和 https://huggingface.co/tavis-benchmark。