Introduction: The COVID-19 pandemic highlighted the importance of making epidemiological data and scientific insights easily accessible and explorable for public health agencies, the general public, and researchers. State-of-the-art approaches for sharing data and insights included regularly updated reports and web dashboards. However, they face a trade-off between the simplicity and flexibility of data exploration. With the capabilities of recent large language models (LLMs) such as GPT-4, this trade-off can be overcome. Results: We developed the chatbot "GenSpectrum Chat" (https://cov-spectrum.org/chat) which uses GPT-4 as the underlying large language model (LLM) to explore SARS-CoV-2 genomic sequencing data. Out of 500 inputs from real-world users, the chatbot provided a correct answer for 453 prompts; an incorrect answer for 13 prompts, and no answer although the question was within scope for 34 prompts. We also tested the chatbot with inputs from 10 different languages, and despite being provided solely with English instructions and examples, it successfully processed prompts in all tested languages. Conclusion: LLMs enable new ways of interacting with information systems. In the field of public health, GenSpectrum Chat can facilitate the analysis of real-time pathogen genomic data. With our chatbot supporting interactive exploration in different languages, we envision quick and direct access to the latest evidence for policymakers around the world.
翻译:引言:COVID-19大流行凸显了让流行病学数据和科学见解易于获取与探索的重要性,这对公共卫生机构、普通公众和研究人员而言尤为关键。分享数据与见解的前沿方法包括定期更新的报告和网络仪表盘,但这些方法在数据探索的简便性与灵活性上存在权衡。借助GPT-4等最新大语言模型(LLM)的能力,这一权衡问题得以解决。结果:我们开发了名为"GenSpectrum Chat"的聊天机器人(网址:https://cov-spectrum.org/chat),该机器人以GPT-4作为底层大语言模型,用于探索SARS-CoV-2基因组测序数据。在来自真实用户的500条输入中,聊天机器人为453条提示提供了正确答案;为13条提示提供了错误答案;另有34条提示虽在问题范围内,但未给出答案。我们还使用10种不同语言的输入测试了该机器人,尽管仅提供了英文指令和示例,它仍成功处理了所有测试语言中的提示。结论:大语言模型开启了与信息系统交互的新途径。在公共卫生领域,GenSpectrum Chat有助于分析实时病原体基因组数据。通过支持多语言交互式探索的聊天机器人,我们期望为全球政策制定者提供即时、直接获取最新科学证据的途径。