Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce \textbf{RegionVLM}, equipped with explicit regional modeling capabilities, allowing them to understand user-indicated image regions. To achieve this, we design a simple yet innovative architecture, requiring no modifications to the model architecture or objective function. Additionally, we leverage a dataset that contains a novel source of information, namely Localized Narratives, which has been overlooked in previous VLP research. Our experiments demonstrate that our single generalist model not only achieves an interactive dialogue system but also exhibits superior performance on various zero-shot region understanding tasks, without compromising its ability for global image understanding.
翻译:近期视觉-语言预训练(VLP)模型取得了显著进展。然而,这类模型严重依赖仅捕获图像粗粒度全局信息的图像-文本对,这限制了其区域理解能力。本研究提出了**RegionVLM**,该模型具备显式区域建模能力,可理解用户指定的图像区域。为此,我们设计了一种简洁而创新的架构,无需修改模型架构或目标函数。此外,我们利用了一个包含新颖信息源的数据库——局部化叙述(Localized Narratives),该信息源在以往的VLP研究中被忽视。实验表明,我们的单一通用模型不仅实现了交互式对话系统,还在各种零样本区域理解任务中展现出优越性能,同时保持了对图像全局理解的能力。