Aloe: A Family of Fine-tuned Open Healthcare LLMs

Ashwin Kumar Gururajan,Enrique Lopez-Cuena,Jordi Bayarri-Planas,Adrian Tormos,Daniel Hinjos,Pablo Bernabeu-Perez,Anna Arias-Duart,Pablo Agustin Martin-Torres,Lucia Urcelay-Ganzabal,Marta Gonzalez-Mallo,Sergio Alvarez-Napagao,Eduard Ayguadé-Parra,Ulises Cortés Dario Garcia-Gasulla

from arxiv, Five appendix

As the capabilities of Large Language Models (LLMs) in healthcare and medicine continue to advance, there is a growing need for competitive open-source models that can safeguard public interest. With the increasing availability of highly competitive open base models, the impact of continued pre-training is increasingly uncertain. In this work, we explore the role of instruct tuning, model merging, alignment, red teaming and advanced inference schemes, as means to improve current open models. To that end, we introduce the Aloe family, a set of open medical LLMs highly competitive within its scale range. Aloe models are trained on the current best base models (Mistral, LLaMA 3), using a new custom dataset which combines public data sources improved with synthetic Chain of Thought (CoT). Aloe models undergo an alignment phase, becoming one of the first few policy-aligned open healthcare LLM using Direct Preference Optimization, setting a new standard for ethical performance in healthcare LLMs. Model evaluation expands to include various bias and toxicity datasets, a dedicated red teaming effort, and a much-needed risk assessment for healthcare LLMs. Finally, to explore the limits of current LLMs in inference, we study several advanced prompt engineering strategies to boost performance across benchmarks, yielding state-of-the-art results for open healthcare 7B LLMs, unprecedented at this scale.

翻译：随着大语言模型在医疗健康领域能力的持续提升，对能够维护公共利益的竞争性开源模型的需求日益迫切。在高度竞争的开源基座模型日益普及的背景下，持续预训练的影响日益不确定。本研究探索了指令调优、模型融合、对齐、红队测试及高级推理方案作为改进现有开源模型手段的作用。为此，我们推出Aloe家族——一套在其规模范围内具备高度竞争力的开源医疗大语言模型。Aloe模型基于当前最优基座模型（Mistral、LLaMA 3）训练，采用融合公开数据源与合成思维链增强的新型定制数据集。通过直接偏好优化对齐阶段，Aloe模型成为首批实现政策对齐的开源医疗大语言模型之一，为医疗大语言模型的伦理性能设立了新标准。模型评估涵盖多种偏见与毒性数据集、专项红队测试，以及当前亟需的医疗大语言模型风险评估。最后，为探究当前大语言模型在推理层面的极限，我们研究多种高级提示工程策略以提升跨基准性能，最终在7B参数规模的开源医疗大语言模型上取得了前所未有的最优结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日