Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on three datasets: MedQA-USMLE 60.2%, MedMCQA 62.7% and PubMedQA 78.2%. Open-source models are closing the gap: Llama-2 70B also passed the MedQA-USMLE with 62.5% accuracy.
翻译:尽管大型语言模型(LLMs)常能生成令人印象深刻的结果,但它们在需要强推理能力和专业领域知识的真实场景中表现如何仍不明确。本研究旨在探究闭源与开源模型(如GPT-3.5、Llama-2等)能否回答并推理基于真实世界的复杂问题。我们聚焦三个权威医学基准测试集(MedQA-USMLE、MedMCQA和PubMedQA),并采用多种提示策略:思维链(CoT,逐步思考)、少样本学习和检索增强生成。基于专家对生成思维链的标注,我们发现InstructGPT通常能够有效阅读、推理并调取专家知识。最后,通过利用提示工程(少样本学习和集成方法)的最新进展,我们证明GPT-3.5不仅产生校准后的预测分布,还在三个数据集上达到及格线:MedQA-USMLE 60.2%、MedMCQA 62.7%和PubMedQA 78.2%。开源模型正缩小差距:Llama-2 70B同样以62.5%的准确率通过了MedQA-USMLE测试。