Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single-query anomaly scoring or pure benign-versus-attacker user settings. We formulate model extraction monitoring as benign-calibrated traffic-window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign-vs-benign comparisons to set the decision threshold. We evaluate on fourteen attacker-normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure-attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign-calibrated distribution testing is a strong empirical baseline for model extraction detection in both user-level and mixed multi-user LLM API traffic. Code is released at: https://github.com/LabRAI/mmd-llm-mea-detection.
翻译:大语言模型(LLMs)越来越多地通过托管API进行部署,这使得模型提取成为威胁模型所有权和服务安全性的实际问题。然而,单个提取查询往往类似于良性请求,现有评估通常侧重于单查询异常评分或纯粹的良性vs攻击者用户设定。我们将模型提取监控形式化为基于良性校准的流量窗口分布测试,并展示了一个极其简单的检测器是有效的:将传入查询嵌入语义空间,并测试其聚合分布是否偏离历史良性流量。我们使用最大均值差异(MMD)实例化该检测器,仅通过良性vs良性比较来设定决策阈值。我们在来自四个提取场景的十四对攻击者vs正常查询上进行评估,并与改编的PRADA、SEAT、CAP、DATE和边际马氏距离基线进行比较。在三个随机种子下,MMD实现了0.3%的良性假阳性率、100.0%的纯攻击者真阳性率、攻击者比例下的平均90.5%真阳性率以及95.1%的平衡准确率。这些结果表明,基于良性校准的分布测试是检测用户级和混合多用户LLM API流量中模型提取的强经验基线。代码已发布在:https://github.com/LabRAI/mmd-llm-mea-detection。