The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.
翻译:大视觉语言模型(LVLMs)的快速发展展现了一系列涌现能力。然而,当前模型仅关注单一场景的视觉内容,其跨场景关联实例的能力尚未得到探索,而这对于理解复杂视觉内容(如包含多角色与复杂情节的电影)至关重要。面向电影理解,LVLMs的关键初始步骤在于释放跨多视觉场景的角色身份记忆与识别潜力。为实现这一目标,我们提出基于身份参照的视觉指令微调方法,并开发了身份感知大视觉语言模型IDA-VLM。此外,本研究引入新型评测基准MM-ID,从匹配、定位、问答与描述四个维度评估LVLMs在实例身份记忆与识别方面的能力。研究结果突显了现有LVLMs在通过身份参照识别与关联实例身份方面的局限性。本文为未来人工智能系统具备多身份视觉输入能力开辟了道路,从而促进对电影等复杂视觉叙事内容的理解。