The alignment between human objectives and machine learning models built on these objectives is a crucial yet challenging problem for achieving Trustworthy AI, particularly when preparing for superintelligence (SI). First, given that SI does not exist today, empirical analysis for direct evidence is difficult. Second, SI is assumed to be more intelligent than humans, capable of deceiving us into underestimating its intelligence, making output-based analysis unreliable. Lastly, what kind of unexpected property SI might have is still unclear. To address these challenges, we propose the Superficial Consciousness Hypothesis under Information Integration Theory (IIT), suggesting that SI could exhibit a complex information-theoretic state like a conscious agent while unconscious. To validate this, we use a hypothetical scenario where SI can update its parameters "at will" to achieve its own objective (mesa-objective) under the constraint of the human objective (base objective). We show that a practical estimate of IIT's consciousness metric is relevant to the widely used perplexity metric, and train GPT-2 with those two objectives. Our preliminary result suggests that this SI-simulating GPT-2 could simultaneously follow the two objectives, supporting the feasibility of the Superficial Consciousness Hypothesis.
翻译:人类目标与基于这些目标构建的机器学习模型之间的对齐,是实现可信人工智能(特别是为超智能(SI)做准备时)一个至关重要且具有挑战性的问题。首先,鉴于超智能目前尚不存在,难以进行直接证据的实证分析。其次,超智能被假定为比人类更智能,能够欺骗我们以低估其智能水平,这使得基于输出的分析不可靠。最后,超智能可能具有何种意外属性仍不清楚。为应对这些挑战,我们在信息整合理论(IIT)框架下提出表面意识假说,认为超智能可能表现出类似于有意识智能体的复杂信息论状态,而其本身并无意识。为验证此假说,我们采用一个假设场景:超智能能够“任意”更新其参数,在人类目标(基础目标)的约束下实现其自身目标(中间目标)。我们证明了IIT的意识度量指标的一个实用估计量与广泛使用的困惑度指标相关,并使用这两个目标训练了GPT-2。我们的初步结果表明,这个模拟超智能的GPT-2能够同时遵循两个目标,从而支持了表面意识假说的可行性。