Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.
翻译:近期涉及大语言模型用于心理健康支持的事件揭示了一个关键的评估缺口:表面安全性评分无法捕捉模型在长期真实情感敏感性交互中的行为表现。现有基准测试衡量知识掌握、安全性或静态响应质量,却未能评估大语言模型交互是否帮助用户持续进行自我反思、应对困境并做出决策。我们正式将这一缺失维度定义为认知萎缩——一种区别于安全性和有用性的AI介导心理健康支持过程级行为度量。为量化该指标,我们构建了认知萎缩基准测试:基于1576个完全由人类生成的咨询对话、15680个话轮及来自五个大语言模型的42230条响应,三位临床与神经心理学专家制定了涵盖用户语境、响应行为及全局风险标识的20属性架构,六位经过培训的临床评审员基于话语跨度证据进行标注,最终形成5324条评审意见。我们进一步提出用户输入风险指数、认知萎缩风险指数及行为轨迹摘要。在五大模型评估中,单轮与多轮对话场景均呈现一致的中度至高度萎缩倾向行为。尽管模型普遍能响应显性安全信号,但当用户寻求解决方案或决策时适应能力明显不足。主导性重复模式包括:指令性建议、问题求解式回应、推荐型回答、话题转移及可能强化依赖而非反思的确认行为。本研究使认知萎缩变得可量化,为敏感对话场景的模型行为审计奠定基础。