We introduce an open-ended test grounded in algorithmic probability that can avoid benchmark contamination in the quantitative evaluation of frontier models in the context of their Artificial General Intelligence (AGI) and Superintelligence (ASI) claims. Unlike other tests, this test does not rely on statistical compression methods (such as GZIP or LZW), which are more closely related to Shannon entropy than to Kolmogorov complexity and are not able to test beyond simple pattern matching. The test challenges aspects of AI, in particular LLMs, related to features of intelligence of fundamental nature such as synthesis and model creation in the context of inverse problems (generating new knowledge from observation). We argue that metrics based on model abstraction and abduction (optimal Bayesian `inference') for predictive `planning' can provide a robust framework for testing intelligence, including natural intelligence (human and animal), narrow AI, AGI, and ASI. We found that LLM model versions tend to be fragile and incremental as a result of memorisation only with progress likely driven by the size of training data. The results were compared with a hybrid neurosymbolic approach that theoretically guarantees universal intelligence based on the principles of algorithmic probability and Kolmogorov complexity. The method outperforms LLMs in a proof-of-concept on short binary sequences. We prove that compression is equivalent and directly proportional to a system's predictive power and vice versa. That is, if a system can better predict it can better compress, and if it can better compress, then it can better predict. Our findings strengthen the suspicion regarding the fundamental limitations of LLMs, exposing them as systems optimised for the perception of mastery over human language.
翻译:本文提出一种基于算法概率的开放式测试方法,旨在避免前沿模型在宣称实现人工通用智能(AGI)与超级智能(ASI)时,其定量评估过程中存在的基准污染问题。与依赖统计压缩方法(如GZIP或LZW)的其他测试不同——这类方法更接近香农熵而非柯尔莫哥洛夫复杂度,且仅能测试简单模式匹配——本测试针对人工智能(特别是大语言模型)在反问题情境下(即从观测中生成新知识)与智能本质特征相关的能力提出挑战,例如综合能力与模型创建能力。我们认为,基于模型抽象与溯因推理(最优贝叶斯“推断”)以进行预测性“规划”的度量指标,能够为测试智能(包括自然智能(人类与动物)、狭义AI、AGI及ASI)提供一个稳健的框架。研究发现,大语言模型的版本迭代往往呈现脆弱性与渐进性,这可能是仅依赖记忆化学习的结果,其进展很可能由训练数据规模驱动。我们将这些结果与一种理论上基于算法概率与柯尔莫哥洛夫复杂度原理保证通用智能的混合神经符号方法进行了比较。在针对短二进制序列的概念验证中,该方法表现优于大语言模型。我们证明了压缩能力与系统的预测能力等价且成正比,反之亦然。即,若一个系统能更好地预测,则它能更好地压缩;若它能更好地压缩,则它能更好地预测。我们的研究结果强化了对大语言模型根本局限性的质疑,揭示其本质上是为优化对人类语言的掌控感知而设计的系统。