With the release of ChatGPT and other large language models (LLMs) the discussion about the intelligence, possibilities, and risks, of current and future models have seen large attention. This discussion included much debated scenarios about the imminent rise of so-called "super-human" AI, i.e., AI systems that are orders of magnitude smarter than humans. In the spirit of Alan Turing, there is no doubt that current state-of-the-art language models already pass his famous test. Moreover, current models outperform humans in several benchmark tests, so that publicly available LLMs have already become versatile companions that connect everyday life, industry and science. Despite their impressive capabilities, LLMs sometimes fail completely at tasks that are thought to be trivial for humans. In other cases, the trustworthiness of LLMs becomes much more elusive and difficult to evaluate. Taking the example of academia, language models are capable of writing convincing research articles on a given topic with only little input. Yet, the lack of trustworthiness in terms of factual consistency or the existence of persistent hallucinations in AI-generated text bodies has led to a range of restrictions for AI-based content in many scientific journals. In view of these observations, the question arises as to whether the same metrics that apply to human intelligence can also be applied to computational methods and has been discussed extensively. In fact, the choice of metrics has already been shown to dramatically influence assessments on potential intelligence emergence. Here, we argue that the intelligence of LLMs should not only be assessed by task-specific statistical metrics, but separately in terms of qualitative and quantitative measures.
翻译:随着ChatGPT及其他大型语言模型(LLMs)的发布,关于当前及未来模型的智能水平、潜力与风险的讨论引起了广泛关注。这些讨论包含了许多备受争议的情境,涉及所谓“超人类”人工智能(即智能水平远超人类数个数量级的AI系统)的迫近崛起。秉承艾伦·图灵的精神,当前最先进的语言模型无疑已通过其著名的测试。此外,现有模型在多项基准测试中表现优于人类,使得公开可用的LLMs已成为连接日常生活、工业与科学的多功能助手。尽管LLMs能力卓越,它们有时会在人类认为微不足道的任务上完全失败。在其他情况下,LLMs的可信度则变得更为模糊且难以评估。以学术界为例,语言模型仅需少量输入便能就给定主题撰写具有说服力的研究论文。然而,AI生成文本在事实一致性方面缺乏可信度,以及持续存在的幻觉问题,已导致许多科学期刊对基于AI的内容实施了一系列限制。鉴于这些现象,适用于人类智能的衡量标准是否同样适用于计算方法的问题已被广泛讨论。事实上,度量标准的选择已被证明会显著影响对潜在智能涌现的评估。本文主张,LLMs的智能不仅应通过任务特定的统计指标来评估,还应分别从定性与定量两个维度进行衡量。