We present a Bayesian calibration layer for slopsquat detectors -- those that flag hallucinated package imports in code produced by large language models (LLMs). Where existing pipelines emit binary decisions (flag / do-not-flag), our layer emits a Beta-posterior probability per detection, derived from a 3-category epistemic taxonomy that explicitly classifies each prior as empirically calibrated, constructively argued, or engineering-judgement-traced. Beyond the primary 200/404 registry channel, the calibrated layer exploits PyPI metadata signals -- package age, release count, author descriptor, summary -- to surface registered-but-suspicious packages that a binary registry detector misses, which is the realistic post-LLM-emission attacker regime. The resulting risk-aware primitive is directly consumable by downstream CI gates and supports principled threshold decisions across detection rules. We evaluate the calibration on a merged corpus of 1,734 Python snippets -- a stratified 189-prompt BigCodeBench slice plus a 100-prompt niche-library stress-test set, generated across a six-model panel spanning four cloud models (Claude-Sonnet-4.6, Mistral-Large, DeepSeek-v4-pro, DeepSeek-R1) and two local open-weight code models (Mistral Codestral, Meta CodeLlama). Against a re-implemented binary baseline inspired by Mahmud et al. -- which shares its registry oracle with our ground truth and therefore serves as a degenerate upper bound rather than a genuine competitor -- the calibrated layer reproduces the strict-registry detections and introduces well-calibrated additional flags on the metadata channel. We assess detector asymmetry with a McNemar paired test and calibration with both a flagged-subset Expected Calibration Error and a strictly proper full-corpus Brier score.
翻译:我们提出了一种针对slopsquat检测器的贝叶斯校准层——这类检测器用于标记大语言模型(LLM)生成代码中存在的幻觉包导入。现有检测流程输出二元决策(标记/不标记),而我们的校准层为每次检测输出Beta后验概率,该概率源于一个三类认知分类体系,明确将每个先验归类为经验校准型、建构论证型或工程判断追溯型。除了主流的200/404注册表通道外,校准层还利用PyPI元数据信号(包龄、发布次数、作者描述、摘要)来发现二元注册检测器遗漏的已注册但可疑软件包——这恰是LLM生成代码后攻击者实际采用的场景。由此产生的风险感知原语可直接被下游CI门控使用,并支持跨检测规则的阈值决策原则。我们基于合并的1,734个Python代码片段语料库进行校准评估,该语料库包含分层抽取的189条BigCodeBench提示子集和100条小众库压力测试提示集,通过涵盖四个云端模型(Claude-Sonnet-4.6、Mistral-Large、DeepSeek-v4-pro、DeepSeek-R1)和两个本地开源权重代码模型(Mistral Codestral、Meta CodeLlama)的六模型面板生成。与受Mahmud等人启发的重实现二元基线相比(该基线与我们真实数据共享注册表预言机,因此作为退化上界而非真正的竞争者),校准层复现了严格注册表检测,并在元数据通道上引入了校准良好的额外标记。我们使用McNemar配对检验评估检测器非对称性,并通过标记子集的期望校准误差和严格适度的全语料Brier分数评估校准效果。