Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.
翻译:生成式搜索引擎直接生成对用户查询的响应,并附上内联引用。可信赖的生成式搜索引擎的一个先决特性是可验证性,即系统应做到引用全面(高引用召回率;所有陈述均有引用充分支持)且准确(高引用精确率;每条引用均支持其关联的陈述)。我们通过人工评估对四款流行的生成式搜索引擎——Bing Chat、NeevaAI、perplexity.ai和YouChat——进行了审查,涵盖了来自多种来源(如历史Google用户查询、Reddit上动态收集的开放式问题等)的多样化查询。我们发现,现有生成式搜索引擎的响应流畅且看似信息丰富,但经常包含未经支持的陈述和不准确的引用:平均而言,仅有51.5%的生成句子被引用完全支持,且只有74.5%的引用支持其关联的句子。我们认为,对于可能作为信息寻求用户主要工具的系统而言,这些结果低得令人担忧,尤其是考虑到它们看似可信的表象。我们希望我们的结果能进一步推动可信赖生成式搜索引擎的开发,并帮助研究人员和用户更好地理解现有商业系统的不足。