Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants' trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trustworthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.
翻译:大型语言模型(LLM)越来越多地应用于学术问答(QA)系统,以帮助研究者综合海量文献。然而,这些系统常产生细微错误(例如,缺乏支持的论断、遗漏性错误),而现有的溯源机制(如文献引用)在粒度上不足以满足学术领域所需的严格验证。为此,我们提出了PaperTrail,一种新颖的界面,能够将LLM生成的答案与源文档同时分解为离散的声明和证据,并通过映射关系揭示出有依据的论断、无支持的声明以及源文本中遗漏的信息。我们通过一项包含26名研究者的组内研究对PaperTrail进行了评估,参与者使用PaperTrail和基线界面分别完成两项学术编辑任务。结果表明,与基线相比,PaperTrail显著降低了参与者对生成内容的信任度。然而,这种增强的谨慎态度并未转化为行为改变,因为人们仍倾向于依赖LLM生成的学术修改建议,以避免认知负担较重的任务。我们讨论了声明-证据匹配在理解学术场景中LLM可信度方面的价值,并提出了对溯源信息进行认知友好型传达的设计启示。