As Large Language Models become integral to decision-making, optimism about their power is tempered with concern over their errors. Users may over-rely on LLM advice that is confidently stated but wrong, or under-rely due to mistrust. Reliance interventions have been developed to help users of LLMs, but they lack rigorous evaluation for appropriate reliance. We benchmark the performance of three relevant interventions by conducting a randomized online experiment with 400 participants attempting two challenging tasks: LSAT logical reasoning and image-based numerical estimation. For each question, participants first answered independently, then received LLM advice modified by one of three reliance interventions and answered the question again. Our findings indicate that while interventions reduce over-reliance, they generally fail to improve appropriate reliance. Furthermore, people became more confident after making incorrect reliance decisions in certain contexts, demonstrating poor calibration. Based on our findings, we discuss implications for designing effective reliance interventions in human-LLM collaboration.
翻译:随着大型语言模型成为决策过程中不可或缺的组成部分,对其强大能力的乐观预期正与对其错误的担忧并存。用户可能过度依赖那些表述自信但实则错误的LLM建议,也可能因不信任而依赖不足。目前已开发出多种依赖干预措施以帮助LLM使用者,但这些措施缺乏针对适当依赖的严谨评估。我们通过一项包含400名参与者的随机在线实验,对三种相关干预措施的性能进行了基准测试,实验要求参与者完成两项挑战性任务:LSAT逻辑推理和基于图像的数值估算。针对每个问题,参与者首先独立作答,随后接收经过三种依赖干预措施之一修改的LLM建议,并再次回答问题。我们的研究结果表明:虽然干预措施减少了过度依赖,但总体上未能改善适当依赖。此外,在某些情境下,人们在做出错误的依赖决策后反而变得更加自信,这显示出较差的校准能力。基于研究发现,我们探讨了在人机协作中设计有效依赖干预措施的启示。