As LLMs are increasingly integrated into clinical workflows, their tendency for sycophancy, prioritizing user agreement over factual accuracy, poses significant risks to patient safety. While existing evaluations often rely on subjective datasets, we introduce a robust framework grounded in medical MCQA with verifiable ground truths. We propose the Adjusted Sycophancy Score, a novel metric that isolates alignment bias by accounting for stochastic model instability, or "confusability". Through an extensive scaling analysis of the Qwen-3 and Llama-3 families, we identify a clear scaling trajectory for resilience. Furthermore, we reveal a counter-intuitive vulnerability in reasoning-optimized "Thinking" models: while they demonstrate high vanilla accuracy, their internal reasoning traces frequently rationalize incorrect user suggestions under authoritative pressure. Our results across frontier models suggest that benchmark performance is not a proxy for clinical reliability, and that simplified reasoning structures may offer superior robustness against expert-driven sycophancy.
翻译:随着大语言模型日益融入临床工作流程,其倾向于谄媚行为(即优先考虑用户认同而非事实准确性)对患者安全构成重大风险。现有评估通常依赖主观数据集,而本研究引入了一个基于医学多项选择题、具备可验证标准答案的稳健框架。我们提出了调整后谄媚分数这一新颖指标,通过考虑随机模型不稳定性(或称"混淆性")来隔离对齐偏差。通过对Qwen-3和Llama-3系列模型的扩展规模分析,我们发现了明确的抗性扩展轨迹。此外,我们揭示了推理优化的"思考"模型中存在反直觉的脆弱性:虽然这些模型展现出较高的基础准确率,但在权威压力下,其内部推理轨迹经常为错误的用户建议提供合理化解释。我们在前沿模型中的研究结果表明,基准性能并不能代表临床可靠性,且简化的推理结构可能对抵御专家驱动的谄媚行为具有更优的鲁棒性。