We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.
翻译:我们提出了广义图灵测试(GTT),这是一个通过不可区分性来比较任意智能体能力的正式框架。对于智能体A和B,定义图灵比较器A $\geq$ B成立,当且仅当B作为区分者无法可靠地区分与A(被指示模仿B)的交互和另一个B实例的交互。这产生了与数据集和任务无关的相对智能概念。我们研究了该比较器的结构,包括其传递性条件(从而在等价类上诱导出排序),并定义和分析了带有查询、有限交互和固定区分者的变体。为补充理论,我们将该框架应用于一组现代模型,通过数千次试验实证评估了成对不可区分性。由此产生的比较呈现出与现有排名一致的分层结构,表明该框架能产生有意义的经验排序。我们的结果将不可区分性定位为推理智能的统一视角,为评估提供了基础,并可能为本质上独立于固定数据集或基准的训练目标奠定基础。