How to Measure the Intelligence of Large Language Models?

With the release of ChatGPT and other large language models (LLMs) the discussion about the intelligence, possibilities, and risks, of current and future models have seen large attention. This discussion included much debated scenarios about the imminent rise of so-called "super-human" AI, i.e., AI systems that are orders of magnitude smarter than humans. In the spirit of Alan Turing, there is no doubt that current state-of-the-art language models already pass his famous test. Moreover, current models outperform humans in several benchmark tests, so that publicly available LLMs have already become versatile companions that connect everyday life, industry and science. Despite their impressive capabilities, LLMs sometimes fail completely at tasks that are thought to be trivial for humans. In other cases, the trustworthiness of LLMs becomes much more elusive and difficult to evaluate. Taking the example of academia, language models are capable of writing convincing research articles on a given topic with only little input. Yet, the lack of trustworthiness in terms of factual consistency or the existence of persistent hallucinations in AI-generated text bodies has led to a range of restrictions for AI-based content in many scientific journals. In view of these observations, the question arises as to whether the same metrics that apply to human intelligence can also be applied to computational methods and has been discussed extensively. In fact, the choice of metrics has already been shown to dramatically influence assessments on potential intelligence emergence. Here, we argue that the intelligence of LLMs should not only be assessed by task-specific statistical metrics, but separately in terms of qualitative and quantitative measures.

翻译：随着ChatGPT及其他大型语言模型（LLMs）的发布，关于当前及未来模型的智能水平、潜力与风险的讨论引起了广泛关注。这些讨论包含了许多备受争议的情境，涉及所谓“超人类”人工智能（即智能水平远超人类数个数量级的AI系统）的迫近崛起。秉承艾伦·图灵的精神，当前最先进的语言模型无疑已通过其著名的测试。此外，现有模型在多项基准测试中表现优于人类，使得公开可用的LLMs已成为连接日常生活、工业与科学的多功能助手。尽管LLMs能力卓越，它们有时会在人类认为微不足道的任务上完全失败。在其他情况下，LLMs的可信度则变得更为模糊且难以评估。以学术界为例，语言模型仅需少量输入便能就给定主题撰写具有说服力的研究论文。然而，AI生成文本在事实一致性方面缺乏可信度，以及持续存在的幻觉问题，已导致许多科学期刊对基于AI的内容实施了一系列限制。鉴于这些现象，适用于人类智能的衡量标准是否同样适用于计算方法的问题已被广泛讨论。事实上，度量标准的选择已被证明会显著影响对潜在智能涌现的评估。本文主张，LLMs的智能不仅应通过任务特定的统计指标来评估，还应分别从定性与定量两个维度进行衡量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

斯坦福李飞飞高徒Johnson博士论文: 组成式计算机视觉智能,195页PDF

专知会员服务

71+阅读 · 2019年10月27日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日