We introduce an open-ended test grounded in algorithmic probability that can avoid benchmark contamination in the quantitative evaluation of frontier models in the context of their Artificial General Intelligence (AGI) and Superintelligence (ASI) claims. Unlike other tests, this test does not rely on statistical compression methods (such as GZIP or LZW), which are more closely related to Shannon entropy than to Kolmogorov complexity and are not able to test beyond simple pattern matching. The test challenges aspects of AI, in particular LLMs, related to features of intelligence of fundamental nature such as synthesis and model creation in the context of inverse problems (generating new knowledge from observation). We argue that metrics based on model abstraction and abduction (optimal Bayesian `inference') for predictive `planning' can provide a robust framework for testing intelligence, including natural intelligence (human and animal), narrow AI, AGI, and ASI. We found that LLM model versions tend to be fragile and incremental as a result of memorisation only with progress likely driven by the size of training data. The results were compared with a hybrid neurosymbolic approach that theoretically guarantees universal intelligence based on the principles of algorithmic probability and Kolmogorov complexity. The method outperforms LLMs in a proof-of-concept on short binary sequences. We prove that compression is equivalent and directly proportional to a system's predictive power and vice versa. That is, if a system can better predict it can better compress, and if it can better compress, then it can better predict. Our findings strengthen the suspicion regarding the fundamental limitations of LLMs, exposing them as systems optimised for the perception of mastery over human language.
翻译:我们提出一种基于算法概率的开放式测试,可在评估前沿模型的人工通用智能(AGI)与超级智能(ASI)宣称时,避免基准污染问题。与其他测试不同,本测试不依赖统计压缩方法(如GZIP或LZW),这类方法更接近香农熵而非柯尔莫哥洛夫复杂度,且仅能测试简单模式匹配。该测试针对人工智能(特别是大语言模型)在反问题情境下(从观测生成新知识)与智能本质特征相关的方面提出挑战,例如综合与模型创建能力。我们认为,基于模型抽象与溯因(最优贝叶斯“推断”)的度量指标可为测试智能提供稳健框架,涵盖自然智能(人类与动物)、狭义人工智能、AGI及ASI。研究发现,大语言模型版本往往因单纯记忆而呈现脆弱性与渐进性,其进展可能主要受训练数据规模驱动。我们将结果与一种混合神经符号方法进行对比,该方法基于算法概率与柯尔莫哥洛夫复杂度原理,理论上能保证通用智能。在针对短二进制序列的概念验证中,本方法优于大语言模型。我们证明压缩能力与系统预测能力等价且直接成正比,反之亦然:若系统能更好预测则能更好压缩,若能更好压缩则能更好预测。这些发现强化了对大语言模型根本局限性的质疑,揭示其本质是针对人类语言掌握感知而优化的系统。