As the capabilities of Large Language Models (LLMs) in healthcare and medicine continue to advance, there is a growing need for competitive open-source models that can safeguard public interest. With the increasing availability of highly competitive open base models, the impact of continued pre-training is increasingly uncertain. In this work, we explore the role of instruct tuning, model merging, alignment, red teaming and advanced inference schemes, as means to improve current open models. To that end, we introduce the Aloe family, a set of open medical LLMs highly competitive within its scale range. Aloe models are trained on the current best base models (Mistral, LLaMA 3), using a new custom dataset which combines public data sources improved with synthetic Chain of Thought (CoT). Aloe models undergo an alignment phase, becoming one of the first few policy-aligned open healthcare LLM using Direct Preference Optimization, setting a new standard for ethical performance in healthcare LLMs. Model evaluation expands to include various bias and toxicity datasets, a dedicated red teaming effort, and a much-needed risk assessment for healthcare LLMs. Finally, to explore the limits of current LLMs in inference, we study several advanced prompt engineering strategies to boost performance across benchmarks, yielding state-of-the-art results for open healthcare 7B LLMs, unprecedented at this scale.
翻译:随着大语言模型在医疗健康领域能力的持续提升,对能够维护公共利益的竞争性开源模型的需求日益迫切。在高度竞争的开源基座模型日益普及的背景下,持续预训练的影响日益不确定。本研究探索了指令调优、模型融合、对齐、红队测试及高级推理方案作为改进现有开源模型手段的作用。为此,我们推出Aloe家族——一套在其规模范围内具备高度竞争力的开源医疗大语言模型。Aloe模型基于当前最优基座模型(Mistral、LLaMA 3)训练,采用融合公开数据源与合成思维链增强的新型定制数据集。通过直接偏好优化对齐阶段,Aloe模型成为首批实现政策对齐的开源医疗大语言模型之一,为医疗大语言模型的伦理性能设立了新标准。模型评估涵盖多种偏见与毒性数据集、专项红队测试,以及当前亟需的医疗大语言模型风险评估。最后,为探究当前大语言模型在推理层面的极限,我们研究多种高级提示工程策略以提升跨基准性能,最终在7B参数规模的开源医疗大语言模型上取得了前所未有的最优结果。