Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google's People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.
翻译:患者越来越多地使用大型语言模型(LLMs)来寻求医疗健康相关问题的答案。然而,针对LLMs问答能力的基准测试往往侧重于医学考试题目,这类问题在风格和内容上与患者实际生活中提出的问题存在显著差异。为弥合这一差距,我们通过查询美国前200种处方药,利用谷歌的"其他人也问"功能收集数据,构建了一个人们常问的医疗问题数据集。所收集问题中有相当一部分包含错误假设和危险意图。我们证明,这些被污染问题的出现并非均匀随机,而在很大程度上取决于导致其出现的历史问题中错误假设的程度。当前在其他基准测试中表现优异的LLMs,在识别日常问题中的错误假设方面仍面临困难。