LLMs in social services: How does chatbot accuracy affect human accuracy?

Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.

翻译：补充营养援助计划（SNAP，即食品券）等社会服务项目的资格规则可能难以理解。对于经常需要协助客户处理十几项甚至更多复杂项目的非营利机构个案工作者而言，基于大型语言模型的聊天机器人或许能为情况较为特殊的客户提供更优质、更快捷的帮助。本文旨在评估基于大型语言模型的聊天机器人建议对个案工作者提供准确指导能力的潜在影响。我们首先创建了一个包含770道选择题的基准数据集，这些问题难度较高但贴合实际，模拟了个案工作者可能遇到的真实咨询场景。随后，利用这些基准问题及相应的专家验证答案，我们对从洛杉矶非营利外展组织招募的个案工作者开展了随机对照实验。对照组个案工作者未获得聊天机器人建议，其平均准确率为49%。实验组个案工作者则接收到我们人为设定不同准确水平的聊天机器人建议，其整体准确率从较低水平（53%）到完美水平（100%）不等。随着聊天机器人质量提升，个案工作者的表现显著改善：高质量聊天机器人（准确率96-100%）使个案工作者准确率提高了27个百分点。在问题层面，错误的聊天机器人建议会大幅降低个案工作者准确率，在对照组表现最佳（无聊天机器人建议）的简单问题上，准确率甚至骤降三分之二。最后，随着聊天机器人准确率提升，个案工作者准确率的改善幅度逐渐趋缓，我们将此现象称为“人工智能依赖不足平台期”。这一现象在实际部署中值得关注，并凸显了结合终端用户评估人机协同工具的重要性。