Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.
翻译:第一人称交互感知是研究人与环境交互的关键分支之一,为开发下一代智能系统奠定基础。然而,现有第一人称交互理解方法无法根据用户查询同时生成连贯的文本与像素级响应,难以灵活适应多样化的下游应用需求。为全面理解第一人称交互,本文提出名为第一人称交互推理与像素定位(Ego-IRG)的新任务。该任务以第一人称图像及查询作为输入,首次通过分析、应答和像素定位三个关键步骤解析交互行为,最终生成流畅的文本与细粒度像素级响应。另一挑战在于现有数据集无法满足Ego-IRG任务需求。为此,本文通过大量人工标注构建了Ego-IRGBench数据集,包含超过2万张第一人称图像、160万条查询及相应的多模态交互响应。此外,我们设计了统一的ANNEXE模型,利用多模态大语言模型生成文本与像素级输出,实现对第一人称交互的全面解析。在Ego-IRGBench上的实验表明,ANNEXE模型相较于其他方法具有显著优势。