covLLM: Large Language Models for COVID-19 Biomedical Literature

The COVID-19 pandemic led to 1.1 million deaths in the United States, despite the explosion of coronavirus research. These new findings are slow to translate to clinical interventions, leading to poorer patient outcomes and unnecessary deaths. One reason is that clinicians, overwhelmed by patients, struggle to keep pace with the rate of new coronavirus literature. A potential solution is developing a tool for evaluating coronavirus literature using large language models (LLMs) -- neural networks that are deployed for natural language processing. LLMs can be used to summarize and extract user-specified information. The greater availability and advancement of LLMs and pre-processed coronavirus literature databases provide the opportunity to assist clinicians in evaluating coronavirus literature through a coronavirus literature specific LLM (covLLM), a tool that directly takes an inputted research article and a user query to return an answer. Using the COVID-19 Open Research Dataset (CORD-19), we produced two datasets: (1) synCovid, which uses a combination of handwritten prompts and synthetic prompts generated using OpenAI, and (2) real abstracts, which contains abstract and title pairs. covLLM was trained with LLaMA 7B as a baseline model to produce three models trained on (1) the Alpaca and synCovid datasets, (2) the synCovid dataset, and (3) the synCovid and real abstract datasets. These models were evaluated by two human evaluators and ChatGPT. Results demonstrate that training covLLM on the synCovid and abstract pairs datasets performs competitively with ChatGPT and outperforms covLLM trained primarily using the Alpaca dataset.

翻译：COVID-19疫情导致美国110万人死亡，尽管冠状病毒研究呈现爆发式增长。这些新发现向临床干预措施的转化进展缓慢，导致患者预后恶化及不必要的死亡。原因之一是临床医生因患者过多而应接不暇，难以跟上冠状病毒新文献的更新速度。一个潜在的解决方案是利用大语言模型（LLMs）——用于自然语言处理的神经网络——开发评估冠状病毒文献的工具。LLMs可用于总结和提取用户指定的信息。随着LLMs及预处理冠状病毒文献数据库的可及性和先进性提升，我们有机会通过构建冠状病毒文献专用LLM（covLLM）协助临床医生评估相关文献。该工具可直接输入研究文章及用户查询，并返回相应答案。基于COVID-19开放研究数据集（CORD-19），我们构建了两个数据集：（1）synCovid，采用手写提示词与OpenAI生成的合成提示词相结合；（2）真实摘要，包含摘要与标题配对数据。covLLM以LLaMA 7B作为基线模型进行训练，产生三个模型：分别基于（1）Alpaca与synCovid数据集训练、（2）synCovid数据集训练、（3）synCovid与真实摘要数据集训练。这些模型由两位人工评估员及ChatGPT进行测评。结果表明，基于synCovid与摘要配对数据集训练的covLLM性能与ChatGPT相当，且优于主要使用Alpaca数据集训练的covLLM。