Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signature that identifies the model, but also leaks the model's final layer parameters when an API distributes logits. We investigate more restrictive APIs that expose token rankings (i.e., their ordering by probability, but not the probability values) and find that rankings also constitute a signature: every model has a unique set of feasible top-$k$ rankings for sufficiently large $k$. Furthermore, the ranking signature is the first known (polynomially) unforgeable signature, since finding a model with the same set of feasible rankings is NP-hard. On the security front, we find that token rankings are already sufficient to approximately steal the final layer of the model, similar to logits, though the approximation is too coarse to forge the signature, and can be effectively countered by restricting the API to top-$k$ tokens with sufficiently small $k$. Since the top-$k$ required to present the model signature is generally smaller than the $k$ required to prevent stealing, it is possible for an API to present an unforgeable signature without leaking model parameters.
翻译:已知语言模型参数会对其logit输出施加独特的(对每个模型)几何约束,这构成了识别模型的签名,但当API分发logits时也会泄露模型的最终层参数。我们研究了更受限制的API,这些API仅暴露令牌排名(即按概率排序的顺序,但不给出概率值),并发现排名同样构成签名:对于足够大的$k$,每个模型都有一组独特的可行前$k$排名。此外,排名签名是首个已知的(多项式时间内)不可伪造签名,因为找到具有相同可行排名集的模型是NP难的。在安全性方面,我们发现令牌排名已经足以近似窃取模型的最终层,类似于logits,尽管这种近似过于粗糙而无法伪造签名,并且可以通过将API限制为足够小的前$k$个令牌来有效应对。由于展示模型签名所需的前$k$通常小于防止窃取所需的$k$,因此API可以在不泄露模型参数的情况下呈现不可伪造的签名。