We introduce an open-ended test grounded in algorithmic probability that can avoid benchmark contamination in the quantitative evaluation of frontier models in the context of their Artificial General Intelligence (AGI) and Superintelligence (ASI) claims. Unlike other tests, this test does not rely on statistical compression methods (such as GZIP or LZW), which are more closely related to Shannon entropy than to Kolmogorov complexity and are not able to test beyond simple pattern matching. The test challenges aspects of AI, in particular LLMs, related to features of intelligence of fundamental nature such as synthesis and model creation in the context of inverse problems (generating new knowledge from observation). We argue that metrics based on model abstraction and abduction (optimal Bayesian `inference') for predictive `planning' can provide a robust framework for testing intelligence, including natural intelligence (human and animal), narrow AI, AGI, and ASI. We found that LLM model versions tend to be fragile and incremental as a result of memorisation only with progress likely driven by the size of training data. The results were compared with a hybrid neurosymbolic approach that theoretically guarantees universal intelligence based on the principles of algorithmic probability and Kolmogorov complexity. The method outperforms LLMs in a proof-of-concept on short binary sequences. We prove that compression is equivalent and directly proportional to a system's predictive power and vice versa. That is, if a system can better predict it can better compress, and if it can better compress, then it can better predict. Our findings strengthen the suspicion regarding the fundamental limitations of LLMs, exposing them as systems optimised for the perception of mastery over human language.
翻译:我们提出了一种基于算法概率的开放式测试,旨在避免前沿模型在宣称实现人工通用智能(AGI)与超级智能(ASI)时,其定量评估中存在的基准污染问题。与其他测试不同,本测试不依赖于统计压缩方法(如GZIP或LZW),这些方法更接近香农熵而非柯尔莫哥洛夫复杂度,且无法超越简单模式匹配进行测试。该测试挑战了人工智能(尤其是大型语言模型)在涉及智能根本特性(如反问题背景下的综合与模型创建,即从观察中生成新知识)方面的能力。我们认为,基于模型抽象与溯因(最优贝叶斯“推理”)以进行预测性“规划”的度量标准,能够为测试智能(包括自然智能(人类与动物)、狭义AI、AGI及ASI)提供一个稳健的框架。我们发现,大型语言模型的版本往往具有脆弱性和增量性,这可能是仅依赖记忆的结果,其进展很可能由训练数据规模驱动。我们将这些结果与一种混合神经符号方法进行了比较,该方法基于算法概率与柯尔莫哥洛夫复杂度原理,理论上保证了通用智能。在针对短二进制序列的概念验证中,该方法的表现优于大型语言模型。我们证明了压缩能力与系统的预测能力等价且直接成正比,反之亦然。也就是说,若一个系统能更好地预测,则它能更好地压缩;若它能更好地压缩,则它能更好地预测。我们的研究结果强化了关于大型语言模型存在根本性局限的怀疑,揭示其本质上是为优化对人类语言的掌控感而设计的系统。