Elsevier Arena：化学/生物学/健康领域基础大语言模型的人工评估 (Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models)

The quality and capabilities of large language models cannot be currently fully assessed with automated, benchmark evaluations. Instead, human evaluations that expand on traditional qualitative techniques from natural language generation literature are required. One recent best-practice consists in using A/B-testing frameworks, which capture preferences of human evaluators for specific models. In this paper we describe a human evaluation experiment focused on the biomedical domain (health, biology, chemistry/pharmacology) carried out at Elsevier. In it a large but not massive (8.8B parameter) decoder-only foundational transformer trained on a relatively small (135B tokens) but highly curated collection of Elsevier datasets is compared to OpenAI's GPT-3.5-turbo and Meta's foundational 7B parameter Llama 2 model against multiple criteria. Results indicate -- even if IRR scores were generally low -- a preference towards GPT-3.5-turbo, and hence towards models that possess conversational abilities, are very large and were trained on very large datasets. But at the same time, indicate that for less massive models training on smaller but well-curated training sets can potentially give rise to viable alternatives in the biomedical domain.

翻译：目前，大型语言模型的质量与能力尚无法通过自动化基准评估进行全面衡量。相反，需要基于自然语言生成文献中的传统定性方法进行扩展的人工评估。近期的一项最佳实践是采用A/B测试框架，以获取人类评估者对特定模型的偏好。本文描述了爱思唯尔开展的一项聚焦生物医学领域（健康、生物学、化学/药理学）的人工评估实验。实验中，将一个基于相对较小（1350亿词元）但经过高度筛选的爱思唯尔数据集训练而成、规模较大但非巨型（88亿参数）的仅解码器基础Transformer模型，与OpenAI的GPT-3.5-turbo及Meta的基础70亿参数Llama 2模型，就多项标准进行比较。结果表明——尽管组内相关系数得分普遍较低——评估者更倾向于GPT-3.5-turbo，亦即更青睐具备对话能力、参数量极大且基于超大规模数据集训练的模型。但与此同时，结果也表明对于参数量相对较小的模型，在规模较小但精心筛选的训练集上进行训练，有望在生物医学领域形成可行的替代方案。

相关内容

Elsevier

关注 0

爱思唯尔提供信息分析解决方案和数字化工具，包括研究战略管理、研发绩效、临床决策支持、专业教育等。其前身可追溯自16世纪，而现代公司则起于1880年，爱思唯尔出版2500余种期刊，包括《柳叶刀》、《四面体》、《细胞》以及教科书《格雷氏解剖学》等。每年共有350,000篇论文发表在爱思唯尔公司出版的期刊中，以及全世界最大的摘要和引文数据库Scopus等。爱思唯尔（Elsevier）是医学与其他科学文献出版社之一，2016年中国高被引学者榜单的研究数据来自爱思唯尔旗下的Scopus数据库，共有来自社会科学、物理、化学、数学、经济等38个学科的1776名有世界影响力的中国学者入选。学术出版业巨头爱思唯尔（Elsevier）正式发布2017年中国高被引学者（Chinese Most Cited Researchers）榜单，本次国内共有1793位学者入选。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日