The advent of LLMs has given rise to generative search, a new search paradigm in which LLMs retrieve information from the web related to a query and synthesize it into a single, coherent response. This paradigm differs fundamentally from traditional web search, where results are returned as a ranked list of independent web pages. In this paper, we ask: Along what dimensions does generative search differ from traditional search? We conduct a systematic comparison between Google organic search and five generative search systems from three providers: Google, OpenAI, and Perplexity. Our analysis reveals substantial variation among engines in their reliance on internal v.s. external knowledge, source diversity, and stability. While generative systems often achieve topical coverage comparable to traditional search, they do so using markedly different retrieval footprints and synthesis strategies. We further show that the outputs of generative search can vary across time and executions, raising new challenges for robustness. Our findings demonstrate that generative search introduces new dimensions that are not captured by existing evaluation paradigms, motivating the development of evaluations that explicitly account for retrieval behavior, synthesis, and stability in generative search systems.
翻译:大语言模型的出现催生了生成式搜索这一新范式——模型从网络中检索与查询相关的信息,并将其整合为连贯的单一回答。该范式与传统网络搜索存在根本性差异,传统搜索返回的是独立网页的排序列表。本文提出研究问题:生成式搜索与传统搜索在哪些维度上存在差异?我们系统对比了Google自然搜索结果与来自三家服务商(Google、OpenAI和Perplexity)的五种生成式搜索系统。分析揭示,不同引擎在依赖内部知识与外部知识、来源多样性和稳定性方面存在显著差异。尽管生成式系统通常能达到与传统搜索相当的专题覆盖率,但其采用的检索足迹和综合策略截然不同。我们进一步发现,生成式搜索的输出会随时间及执行过程发生变化,这对系统的稳健性提出了新挑战。研究结果表明,生成式搜索引入了现有评估范式无法捕捉的新维度,这推动了需明确考虑生成式搜索系统中检索行为、综合过程与稳定性的评估体系开发。