Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video.
翻译:从第一人称视角将文本表达与场景物体进行 grounding,是开发具备环境感知能力并能遵循直观文本指令行动的智能体所需的关键能力。这种能力对于玻璃设备或自主机器人在真实世界中定位被指代物体至关重要。然而,在传统的图像指代表达理解任务中,数据集大多基于网络爬取数据构建,未能反映真实世界中不同场景下将文本表达与多样物体进行 grounding 的复杂结构。近期,大规模第一人称视频数据集Ego4D被提出,该数据集涵盖了全球多样化的真实场景,包括购物、烹饪、行走、交谈、制造等大量室内外情境。基于Ego4D的第一人称视频,我们构建了覆盖广泛的视频指代表达理解数据集:RefEgo。该数据集包含超过12,000个视频剪辑及41小时用于视频指代表达理解的标注数据。实验中,我们将最先进的二维指代表达理解模型与目标跟踪算法相结合,实现了视频级被指代目标跟踪,即使在困难条件下(如视频中途被指代物体移出画面或视频中存在多个相似物体)仍能有效运作。