Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to comprehensively evaluate the Trustworthiness of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly release our benchmark and code in https://cares-ai.github.io/.
翻译:人工智能已显著影响医疗应用,特别是随着医疗大型视觉语言模型(Med-LVLMs)的出现,为自动化和个性化医疗的未来带来了乐观前景。然而,Med-LVLMs的可信度仍未得到验证,这为未来模型部署带来了重大风险。本文提出CARES基准,旨在全面评估Med-LVLMs在医疗领域的可信度。我们从可信性、公平性、安全性、隐私性和鲁棒性五个维度评估Med-LVLMs的可信度。CARES包含约4.1万个封闭式和开放式问答对,涵盖16种医学影像模态和27个解剖区域。我们的分析表明,这些模型在可信度方面持续存在问题,常出现事实性错误,且未能保持跨人口群体的公平性。此外,它们易受攻击并表现出隐私意识缺失。我们已在https://cares-ai.github.io/公开基准数据和代码。