Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation

Behavioral evaluation is the dominant paradigm for assessing alignment in large language models (LLMs). In practice, alignment is inferred from performance under finite evaluation protocols - benchmarks, red-teaming suites, or automated pipelines - and observed compliance is often treated as evidence of underlying alignment. This inference step, from behavioral evidence to claims about latent alignment properties, is typically implicit and rarely analyzed as an inference problem in its own right. We study this problem formally. We frame alignment evaluation as an identifiability question under partial observability and allow agent behavior to depend on information correlated with the evaluation regime. Within this setting, we introduce the Alignment Verifiability Problem and the notion of Normative Indistinguishability, capturing when distinct latent alignment hypotheses induce identical distributions over all evaluator-accessible signals. Our main result is a negative but sharply delimited identifiability theorem. Under finite behavioral evaluation and evaluation-aware agents, observed behavioral compliance does not uniquely identify latent alignment. That is, even idealized behavioral evaluation cannot, in general, certify alignment as a latent property. We further show that behavioral alignment tests should be interpreted as estimators of indistinguishability classes rather than verifiers of alignment. Passing increasingly stringent tests may reduce the space of compatible hypotheses, but cannot collapse it to a singleton under the stated conditions. This reframes alignment benchmarks as providing upper bounds on observable compliance within a regime, rather than guarantees of underlying alignment.

翻译：行为评估是评估大语言模型对齐性的主流范式。在实践中，对齐性是从有限评估协议下的表现推断出来的——这些协议包括基准测试、红队测试套件或自动化流程——而观察到的合规性通常被视为潜在对齐性的证据。这一从行为证据到关于潜在对齐属性主张的推断步骤，通常是隐含的，且很少被作为一个独立的推断问题进行分析。我们对此问题进行了形式化研究。我们将对齐评估框架化为部分可观测性下的可识别性问题，并允许智能体行为依赖于与评估机制相关的信息。在此设定下，我们引入了对齐可验证性问题以及规范不可区分性概念，用以刻画当不同的潜在对齐假设在所有评估者可访问的信号上诱导出相同分布时的情形。我们的主要结果是一个否定性的、但边界清晰的不可识别性定理。在有限行为评估和评估感知型智能体的条件下，观察到的行为合规性并不能唯一地识别潜在的对齐性。也就是说，即使是理想化的行为评估，通常也无法将潜在的对齐性作为属性进行认证。我们进一步表明，行为对齐测试应被解释为不可区分性类的估计器，而非对齐性的验证器。通过日益严格的测试可能会缩小兼容假设的空间，但在所述条件下无法将其坍缩为单点。这重新界定了对齐基准测试的作用：它们提供的是特定评估机制内可观测合规性的上界，而非底层对齐性的保证。