We assessed the performance of commercial Large Language Models (LLMs) GPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b Phase B, which is focused on answer generation, both models demonstrated competitive abilities with leading systems. Remarkably, they achieved this with simple zero-shot learning, grounded with relevant snippets. Even without relevant snippets, their performance was decent, though not on par with the best systems. Interestingly, the older and cheaper GPT-3.5-Turbo system was able to compete with GPT-4 in the grounded Q&A setting on factoid and list answers. In Task 11b Phase A, focusing on retrieval, query expansion through zero-shot learning improved performance, but the models fell short compared to other systems. The code needed to rerun these experiments is available through GitHub.
翻译:我们评估了商用大型语言模型(LLMs)GPT-3.5-Turbo和GPT-4在2023年BioASQ挑战赛任务中的表现。在专注于答案生成的Task 11b Phase B中,两种模型均展现了与领先系统相竞争的能力。值得注意的是,它们通过简单的零样本学习,结合相关片段作为依据便取得了这一成果。即便缺少相关片段,它们的表现虽不及最佳系统,但仍属可观。有趣的是,较旧且成本更低的GPT-3.5-Turbo在基于事实和列表答案的有依据问答场景中,能够与GPT-4一较高下。在聚焦信息检索的Task 11b Phase A中,通过零样本学习进行查询扩展提升了性能,但模型表现仍逊于其他系统。复现这些实验所需的代码可通过GitHub获取。