SuperARC: A Test for General and Super Intelligence Based on First Principles of Recursion Theory and Algorithmic Probability

We introduce an open-ended test grounded in algorithmic probability that can avoid benchmark contamination in the quantitative evaluation of frontier models in the context of their Artificial General Intelligence (AGI) and Superintelligence (ASI) claims. Unlike other tests, this test does not rely on statistical compression methods (such as GZIP or LZW), which are more closely related to Shannon entropy than to Kolmogorov complexity. The test challenges aspects related to features of intelligence of fundamental nature such as synthesis and model creation in the context of inverse problems (generating new knowledge from observation). We argue that metrics based on model abstraction and optimal Bayesian inference for planning can provide a robust framework for testing intelligence, including natural intelligence (human and animal), narrow AI, AGI, and ASI. Our results show no clear evidence of LLM convergence towards a defined level of intelligence, particularly AGI or ASI. We found that LLM model versions tend to be fragile and incremental, as new versions may perform worse than older ones, with progress largely driven by the size of training data. The results were compared with a hybrid neurosymbolic approach that theoretically guarantees model convergence from optimal inference based on the principles of algorithmic probability and Kolmogorov complexity. The method outperforms LLMs in a proof-of-concept on short binary sequences. Our findings confirm suspicions regarding the fundamental limitations of LLMs, exposing them as systems optimised for the perception of mastery over human language. Progress among different LLM versions from the same developers was found to be inconsistent and limited, particularly in the absence of a solid symbolic counterpart.

翻译：我们提出一种基于算法概率的开放式测试，可在前沿模型宣称实现人工通用智能（AGI）与超级智能（ASI）的量化评估中避免基准污染。与其他测试不同，本测试不依赖统计压缩方法（如GZIP或LZW），这类方法更接近香农熵而非柯尔莫哥洛夫复杂度。该测试针对智能本质特征的相关方面提出挑战，例如在反问题背景下（从观察中生成新知识）的合成与模型构建能力。我们认为，基于模型抽象与规划最优贝叶斯推断的度量方法能够为测试智能（包括自然智能（人类与动物）、狭义人工智能、AGI及ASI）提供稳健框架。研究结果显示，未发现大语言模型向特定智能水平（尤其是AGI或ASI）收敛的明确证据。我们发现大语言模型版本往往具有脆弱性与增量性，新版本性能可能逊于旧版本，其进展主要受训练数据规模驱动。研究结果与一种混合神经符号方法进行了对比，该方法基于算法概率与柯尔莫哥洛夫复杂度原理，理论上能保证通过最优推断实现模型收敛。在短二进制序列的概念验证中，该方法表现优于大语言模型。我们的发现证实了关于大语言模型根本局限性的质疑，揭示其本质上是为优化对人类语言的掌控感知而设计的系统。同一开发商不同版本大语言模型的进展呈现不一致性与局限性，在缺乏坚实符号系统支撑时尤为明显。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日