The quality and capabilities of large language models cannot be currently fully assessed with automated, benchmark evaluations. Instead, human evaluations that expand on traditional qualitative techniques from natural language generation literature are required. One recent best-practice consists in using A/B-testing frameworks, which capture preferences of human evaluators for specific models. In this paper we describe a human evaluation experiment focused on the biomedical domain (health, biology, chemistry/pharmacology) carried out at Elsevier. In it a large but not massive (8.8B parameter) decoder-only foundational transformer trained on a relatively small (135B tokens) but highly curated collection of Elsevier datasets is compared to OpenAI's GPT-3.5-turbo and Meta's foundational 7B parameter Llama 2 model against multiple criteria. Results indicate -- even if IRR scores were generally low -- a preference towards GPT-3.5-turbo, and hence towards models that possess conversational abilities, are very large and were trained on very large datasets. But at the same time, indicate that for less massive models training on smaller but well-curated training sets can potentially give rise to viable alternatives in the biomedical domain.
翻译:目前,大型语言模型的质量与能力尚无法通过自动化基准评估进行全面衡量。相反,需要基于自然语言生成文献中的传统定性方法进行扩展的人工评估。近期的一项最佳实践是采用A/B测试框架,以获取人类评估者对特定模型的偏好。本文描述了爱思唯尔开展的一项聚焦生物医学领域(健康、生物学、化学/药理学)的人工评估实验。实验中,将一个基于相对较小(1350亿词元)但经过高度筛选的爱思唯尔数据集训练而成、规模较大但非巨型(88亿参数)的仅解码器基础Transformer模型,与OpenAI的GPT-3.5-turbo及Meta的基础70亿参数Llama 2模型,就多项标准进行比较。结果表明——尽管组内相关系数得分普遍较低——评估者更倾向于GPT-3.5-turbo,亦即更青睐具备对话能力、参数量极大且基于超大规模数据集训练的模型。但与此同时,结果也表明对于参数量相对较小的模型,在规模较小但精心筛选的训练集上进行训练,有望在生物医学领域形成可行的替代方案。