SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.

翻译：多模态接地命名实体识别（GMNER）旨在从图像-文本对中提取命名实体并定位其视觉区域，是实现多种下游应用的关键能力。在开放世界社交媒体平台中，由于长尾、快速演变及未知实体的普遍存在，GMNER仍面临挑战。为解决此问题，现有方法通常依赖通过启发式检索的外部知识探索，或通过多模态大语言模型（MLLM）迭代优化的内部知识利用。然而，启发式检索常引入噪声或冲突证据，降低已知实体识别精度；而单纯内部利用受限于MLLM的知识边界，易产生幻觉。为此，我们提出SAKE——一种端到端智能体框架，通过自知推理和自适应搜索工具调用，协调内部知识利用与外部知识探索。我们采用两阶段训练范式实现该方法：首先提出难度感知搜索标签生成机制，通过多次前向采样量化模型在实体层级的不确定性，生成显式知识缺口信号。基于这些信号构建SAKE-SeCoT高质量思维链数据集，通过监督微调赋予模型基础自知能力和工具使用能力；其次采用智能体强化学习，结合混合奖励函数惩罚不必要的检索操作，使模型从僵化的搜索模仿进化为真正的自知决策——判断何时真正需要检索。在两大社交媒体基准数据集上的大量实验证明了SAKE的有效性。