Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. One important step towards this goal is to localize and track key active objects that undergo major state change as a consequence of human actions/interactions to the environment without being told exactly what/where to ground (e.g., localizing and tracking the `sponge` in video from the instruction "Dip the `sponge` into the bucket."). While existing works approach this problem from a pure vision perspective, we investigate to which extent the textual modality (i.e., task instructions) and their interaction with visual modality can be beneficial. Specifically, we propose to improve phrase grounding models' ability on localizing the active objects by: (1) learning the role of `objects undergoing change` and extracting them accurately from the instructions, (2) leveraging pre- and post-conditions of the objects during actions, and (3) recognizing the objects more robustly with descriptional knowledge. We leverage large language models (LLMs) to extract the aforementioned action-object knowledge, and design a per-object aggregation masking technique to effectively perform joint inference on object phrases and symbolic knowledge. We evaluate our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments demonstrate the effectiveness of our proposed framework, which leads to>54% improvements in all standard metrics on the TREK-150-OPE-Det localization + tracking task, >7% improvements in all standard metrics on the TREK-150-OPE tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD task.

翻译：从自我中心视角主动接地任务指令的能力对于AI代理完成任务或虚拟辅助人类至关重要。实现这一目标的关键步骤之一是在无需明确告知定位内容或位置的情况下（例如，根据指令"将海绵浸入桶中"在视频中定位并跟踪`海绵`），定位和跟踪因人类与环境交互/作用而发生主要状态变化的关键主动对象。尽管现有工作从纯视觉角度处理此问题，我们研究了文本模态（即任务指令）及其与视觉模态交互的有益程度。具体而言，我们通过以下方式提升短语基础模型定位主动对象的能力：（1）学习"经历变化的对象"的角色并从指令中精确提取它们；（2）利用动作过程中对象的前置与后置条件；（3）通过描述性知识更鲁棒地识别对象。我们利用大语言模型（LLMs）提取上述动作-对象知识，并设计了一种逐对象聚合掩码技术，以有效联合推理对象短语与符号知识。我们在Ego4D和Epic-Kitchens数据集上评估了我们的框架。大量实验证明了所提框架的有效性，其在TREK-150-OPE-Det定位+跟踪任务的所有标准指标上提升超过54%，在TREK-150-OPE跟踪任务的所有标准指标上提升超过7%，在Ego4D SCOD任务的平均精度（AP）上提升超过3%。