We introduce an open-ended test grounded in algorithmic probability that can avoid benchmark contamination in the quantitative evaluation of frontier models in the context of their Artificial General Intelligence (AGI) and Superintelligence (ASI) claims. Unlike other tests, this test does not rely on statistical compression methods (such as GZIP or LZW), which are more closely related to Shannon entropy than to Kolmogorov complexity. The test challenges aspects related to features of intelligence of fundamental nature such as synthesis and model creation in the context of inverse problems (generating new knowledge from observation). We argue that metrics based on model abstraction and optimal Bayesian inference for planning can provide a robust framework for testing intelligence, including natural intelligence (human and animal), narrow AI, AGI, and ASI. Our results show no clear evidence of LLM convergence towards a defined level of intelligence, particularly AGI or ASI. We found that LLM model versions tend to be fragile and incremental, as new versions may perform worse than older ones, with progress largely driven by the size of training data. The results were compared with a hybrid neurosymbolic approach that theoretically guarantees model convergence from optimal inference based on the principles of algorithmic probability and Kolmogorov complexity. The method outperforms LLMs in a proof-of-concept on short binary sequences. Our findings confirm suspicions regarding the fundamental limitations of LLMs, exposing them as systems optimised for the perception of mastery over human language. Progress among different LLM versions from the same developers was found to be inconsistent and limited, particularly in the absence of a solid symbolic counterpart.
翻译:我们提出一种基于算法概率的开放式测试,可在前沿模型宣称实现人工通用智能(AGI)与超级智能(ASI)的量化评估中避免基准污染。与其他测试不同,本测试不依赖统计压缩方法(如GZIP或LZW),这类方法更接近香农熵而非柯尔莫哥洛夫复杂度。该测试针对智能本质特征的相关方面提出挑战,例如在反问题背景下(从观察中生成新知识)的合成与模型构建能力。我们认为,基于模型抽象与规划最优贝叶斯推断的度量方法能够为测试智能(包括自然智能(人类与动物)、狭义人工智能、AGI及ASI)提供稳健框架。研究结果显示,未发现大语言模型向特定智能水平(尤其是AGI或ASI)收敛的明确证据。我们发现大语言模型版本往往具有脆弱性与增量性,新版本性能可能逊于旧版本,其进展主要受训练数据规模驱动。研究结果与一种混合神经符号方法进行了对比,该方法基于算法概率与柯尔莫哥洛夫复杂度原理,理论上能保证通过最优推断实现模型收敛。在短二进制序列的概念验证中,该方法表现优于大语言模型。我们的发现证实了关于大语言模型根本局限性的质疑,揭示其本质上是为优化对人类语言的掌控感知而设计的系统。同一开发商不同版本大语言模型的进展呈现不一致性与局限性,在缺乏坚实符号系统支撑时尤为明显。