Response Generation for Cognitive Behavioral Therapy with Large Language Models: Comparative Study with Socratic Questioning

Dialogue systems controlled by predefined or rule-based scenarios derived from counseling techniques, such as cognitive behavioral therapy (CBT), play an important role in mental health apps. Despite the need for responsible responses, it is conceivable that using the newly emerging LLMs to generate contextually relevant utterances will enhance these apps. In this study, we construct dialogue modules based on a CBT scenario focused on conventional Socratic questioning using two kinds of LLMs: a Transformer-based dialogue model further trained with a social media empathetic counseling dataset, provided by Osaka Prefecture (OsakaED), and GPT-4, a state-of-the art LLM created by OpenAI. By comparing systems that use LLM-generated responses with those that do not, we investigate the impact of generated responses on subjective evaluations such as mood change, cognitive change, and dialogue quality (e.g., empathy). As a result, no notable improvements are observed when using the OsakaED model. When using GPT-4, the amount of mood change, empathy, and other dialogue qualities improve significantly. Results suggest that GPT-4 possesses a high counseling ability. However, they also indicate that even when using a dialogue model trained with a human counseling dataset, it does not necessarily yield better outcomes compared to scenario-based dialogues. While presenting LLM-generated responses, including GPT-4, and having them interact directly with users in real-life mental health care services may raise ethical issues, it is still possible for human professionals to produce example responses or response templates using LLMs in advance in systems that use rules, scenarios, or example responses.

翻译：由认知行为疗法等咨询技术衍生出的预定义或规则控制场景下的对话系统，在心理健康应用中扮演重要角色。尽管需要谨慎响应，但利用新兴的大语言模型生成与情境相关的表述有望提升这些应用的效果。本研究基于聚焦传统苏格拉底式提问的认知行为疗法场景，构建了采用两类大语言模型的对话模块：基于Transformer的对话模型（经大阪府提供的社交媒体共情咨询数据集进一步训练）与OpenAI开发的最先进大语言模型GPT-4。通过对比使用与未使用大语言模型生成响应的系统，本研究探究生成响应对情绪变化、认知变化及对话质量（如共情度）等主观评价指标的影响。结果表明：使用大阪ED模型时未观察到显著改善；而采用GPT-4时，情绪变化量、共情度及其他对话质量指标显著提升。研究提示GPT-4具备较高的咨询能力，但同时也显示即便使用经人类咨询数据集训练的对话模型，其效果未必优于基于场景的对话。尽管将包含GPT-4在内的大语言模型生成的响应直接用于现实心理健康服务中与用户交互可能引发伦理问题，但专业人员仍可预先利用大语言模型为基于规则、场景或示例响应的系统提供响应范例或响应模板。