Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools

Jung In Park,Mahyar Abbasian,Iman Azimi,Dawn T. Bounds,Angela Jun,Jaesu Han,Robert M. McCarron,Jessica Borelli,Parmida Safavi,Sanaz Mirbaha,Jia Li,Mona Mahmoudi,Carmen Wiedenhoeft,Amir M. Rahmani

Objective: This study aims to develop and validate an evaluation framework to ensure the safety and reliability of mental health chatbots, which are increasingly popular due to their accessibility, human-like interactions, and context-aware support. Materials and Methods: We created an evaluation framework with 100 benchmark questions and ideal responses, and five guideline questions for chatbot responses. This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbot. Automated evaluation methods explored included large language model (LLM)-based scoring, an agentic approach using real-time data, and embedding models to compare chatbot responses against ground truth standards. Results: The results highlight the importance of guidelines and ground truth for improving LLM evaluation accuracy. The agentic method, dynamically accessing reliable information, demonstrated the best alignment with human assessments. Adherence to a standardized, expert-validated framework significantly enhanced chatbot response safety and reliability. Discussion: Our findings emphasize the need for comprehensive, expert-tailored safety evaluation metrics for mental health chatbots. While LLMs have significant potential, careful implementation is necessary to mitigate risks. The superior performance of the agentic approach underscores the importance of real-time data access in enhancing chatbot reliability. Conclusion: The study validated an evaluation framework for mental health chatbots, proving its effectiveness in improving safety and reliability. Future work should extend evaluations to accuracy, bias, empathy, and privacy to ensure holistic assessment and responsible integration into healthcare. Standardized evaluations will build trust among users and professionals, facilitating broader adoption and improved mental health support through technology.

翻译：目的：本研究旨在开发并验证一个评估框架，以确保心理健康聊天机器人的安全性与可靠性。这类聊天机器人因其可及性、类人交互和情境感知支持而日益普及。材料与方法：我们创建了一个包含100个基准问题及理想答案、以及五个用于指导聊天机器人回答的准则问题的评估框架。该框架经心理健康专家验证，并在一个基于GPT-3.5-turbo的聊天机器人上进行了测试。探索的自动化评估方法包括基于大语言模型（LLM）的评分、利用实时数据的智能体方法，以及通过嵌入模型将聊天机器人回答与真实标准进行比对。结果：结果表明，准则和真实标准对于提高LLM评估准确性至关重要。能够动态获取可靠信息的智能体方法，显示出与人工评估最佳的一致性。遵循一个经过专家验证的标准化框架，显著提升了聊天机器人回答的安全性与可靠性。讨论：我们的发现强调，需要为心理健康聊天机器人制定全面、由专家定制的安全评估指标。尽管大语言模型潜力巨大，但仍需谨慎实施以降低风险。智能体方法的优越性能凸显了实时数据访问对于提升聊天机器人可靠性的重要性。结论：本研究验证了一个针对心理健康聊天机器人的评估框架，证明了其在提升安全性与可靠性方面的有效性。未来的工作应将评估范围扩展至准确性、偏见、共情和隐私等方面，以确保全面评估并负责任地将其整合到医疗保健中。标准化评估将有助于在用户和专业人士间建立信任，从而推动更广泛的应用，并通过技术改善心理健康支持。