Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-<=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds <=2.1% to forward-only wall-clock at batch 32.
翻译:托管大语言模型提供商存在一种静默替换动机:宣称使用更强的模型,实际却提供更廉价的回复。诸如SVIP等"探测后返回"方案存在并行服务侧信道,因为不诚实提供商可将验证者的探测请求路由至宣称模型,同时用替换模型服务普通用户。我们提出一种承诺-打开协议来填补此漏洞。在打开请求之前,提供商通过默克尔树对其已发布输出在指定探测层的逐位置稀疏自编码器(SAE)特征追踪草图进行承诺。验证者打开随机位置,使用经跨后端噪声校准的公共命名电路探测库进行评分,并通过固定阈值的联合一致性z分数规则作出判定。我们在三个骨干模型上实例化该协议:Qwen3-1.7B、Gemma-2-2B,以及扩展4.5倍至Gemma-2-9B(配备131k特征SAE)。在涵盖同系列升级、跨系列替换及秩≤128自适应LoRA的17种攻击中,所有攻击均在共享且规模稳定的阈值下被拒绝;而相同攻击者均能绕过匹配的SVIP式并行服务基线。一种通过冻结SAE编码器反向传播的白盒端到端攻击未能缩小识别裕度,而一种从不运行M_hon的特征伪造攻击者其界限由内在维度参数以闭合形式限定。承诺机制在批次32下仅增加≤2.1%的前向计算时间。