Generative Search Engine (GSE) leverages the Retrieval-Augmented Generation (RAG) technique and the Large Language Model (LLM) to integrate multi-source information and provide users with accurate and comprehensive responses. Unlike traditional search engines that present results in ranked lists, GSE shifts users' attention from sequential browsing to content-driven subjective perception, not only driving a paradigm shift in information retrieval but also highlighting the importance of enhancing the subjective visibility of content in generative search. In this context, Generative Search Engine Optimization (G-SEO) methods have emerged as a new research focus. With the rapid advancement of Multimodal Retrieval-Augmented Generation (MRAG) techniques, GSE can now efficiently integrate text, images, audio, and video, producing richer responses that better satisfy complex information needs. Existing G-SEO methods, however, remain limited to text-based optimization and fail to fully exploit multimodal data. To address this gap, we propose Caption Injection, the first multimodal G-SEO approach, which extracts captions from images and injects them into textual content, integrating visual semantics to enhance the subjective visibility in generative search. We systematically evaluate Caption Injection on MRAMG, a benchmark for MRAG, under both unimodal and multimodal settings. Experimental results show that Caption Injection significantly outperforms text-only G-SEO baselines under the G-EVAL metric, effectively improving the subjective visibility of content perceived by users, and demonstrating the practical benefits of multimodal information in G-SEO.
翻译:生成式搜索引擎(GSE)利用检索增强生成(RAG)技术和大语言模型(LLM),整合多源信息并为用户提供准确全面的回答。不同于传统搜索引擎以排序列表呈现结果,GSE将用户注意力从顺序浏览转向基于内容的主观感知,不仅推动了信息检索领域的范式变革,更凸显了增强生成式搜索中内容主观可见性的重要性。在此背景下,生成式搜索引擎优化(G-SEO)方法成为新的研究焦点。随着多模态检索增强生成(MRAG)技术的快速发展,GSE现已能高效整合文本、图像、音频和视频,生成更丰富的回答以满足复杂信息需求。然而,现有G-SEO方法仍局限于基于文本的优化,未能充分利用多模态数据。为填补这一空白,我们提出标题注入(Caption Injection)——首个多模态G-SEO方法,该方法从图像中提取标题并注入文本内容,通过整合视觉语义来增强生成式搜索中的主观可见性。我们在MRAG基准测试MRAMG上,对单模态与多模态设置下的标题注入进行了系统评估。实验结果表明,在G-EVAL指标下,标题注入显著优于纯文本G-SEO基线方法,有效提升了用户感知内容的主观可见性,展现了多模态信息在G-SEO中的实用价值。