Retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". To resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do I find convincing?". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.
翻译:检索增强型语言模型正越来越多地处理主观性、争议性及相互矛盾的查询,例如“阿斯巴甜是否与癌症相关”。要解决这类模糊查询,必须检索大量网站并思考“这些证据中,哪些(若有)是我认为具有说服力的?”。本研究旨在探究大型语言模型如何回答这一问题。具体而言,我们构建了ConflictingQA数据集,该数据集将争议性查询与一系列真实证据文档配对,这些文档包含不同的事实(例如定量研究结果)、论证风格(例如诉诸权威)及结论(支持或反对)。利用该数据集,我们通过敏感性分析和反事实分析,探究哪些文本特征最影响大型语言模型的预测。总体而言,我们发现当前模型高度依赖网站内容与查询的相关性,而基本忽略人类认为重要的文体特征(例如文本是否包含科学参考文献或采用中立语气)。综合来看,这些结果凸显了检索增强生成语料库质量的重要性(例如需要过滤错误信息),甚至可能意味着需要调整大型语言模型的训练方式,以更好地与人类判断保持一致。