This study investigates GPT-4's assessment of its performance in healthcare applications. A simple prompting technique was used to prompt the LLM with questions taken from the United States Medical Licensing Examination (USMLE) questionnaire and it was tasked to evaluate its confidence score before posing the question and after asking the question. The questionnaire was categorized into two groups-questions with feedback (WF) and questions with no feedback(NF) post-question. The model was asked to provide absolute and relative confidence scores before and after each question. The experimental findings were analyzed using statistical tools to study the variability of confidence in WF and NF groups. Additionally, a sequential analysis was conducted to observe the performance variation for the WF and NF groups. Results indicate that feedback influences relative confidence but doesn't consistently increase or decrease it. Understanding the performance of LLM is paramount in exploring its utility in sensitive areas like healthcare. This study contributes to the ongoing discourse on the reliability of AI, particularly of LLMs like GPT-4, within healthcare, offering insights into how feedback mechanisms might be optimized to enhance AI-assisted medical education and decision support.
翻译:本研究探讨了GPT-4在医疗应用场景中对其性能的自我评估能力。我们采用简单提示技术,向该大语言模型(LLM)输入取自美国医师执照考试(USMLE)问卷的题目,并要求其在回答问题前后分别评估置信度分数。问卷被分为两组:问题后提供反馈(WF)组与不提供反馈(NF)组。模型需在每道题目前后提供绝对置信度和相对置信度分数。通过统计工具分析实验数据,研究了WF组与NF组的置信度变异性。此外,还进行了序列分析以观察两组性能变化。结果表明,反馈机制会影响相对置信度,但不会导致其持续升高或降低。深入理解大语言模型的性能表现,对于探索其在医疗等敏感领域的应用价值至关重要。本研究为当前关于人工智能(尤其是GPT-4等大语言模型)在医疗领域可靠性的讨论提供了新见解,揭示了如何优化反馈机制以增强AI辅助医学教育与决策支持。