Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.
翻译:生成式搜索引擎直接生成针对用户查询的回复,并附带行内引用。可信赖生成式搜索引擎的一个前提特征是可验证性,即系统应全面引用(高引用召回率:所有陈述均有引用充分支持)且准确引用(高引用精确率:每个引用都支持其关联的陈述)。我们通过人工评估对四种流行的生成式搜索引擎——Bing Chat、NeevaAI、perplexity.ai 和 YouChat——进行了审计,覆盖了来自多种来源(例如历史谷歌用户查询、Reddit上动态收集的开放式问题等)的多样化查询。我们发现,现有生成式搜索引擎的回复流畅且看似信息丰富,但经常包含无依据的陈述和不准确的引用:平均而言,仅51.5%的生成句子得到引用的完全支持,仅74.5%的引用支持其关联的句子。我们认为,对于可能作为信息寻求用户主要工具的系统而言,这些结果令人担忧地偏低,尤其是在它们呈现可信赖表象的情况下。我们希望我们的结果能进一步推动可信赖生成式搜索引擎的开发,并帮助研究人员和用户更好地理解现有商业系统的缺陷。