CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift

DNA-synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false-flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control's certified miss-rate constraint, a low-discrimination signal forces the threshold below the entire test-benign mass. We compose three signals derived from a synthesis order's public annotation: $k$-mer Jaccard similarity to known toxins, the trimmed-mean score of a five-LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies $\mathbb{E}[\mathrm{FNR}] \le α+ \mathrm{TV}$, where the additive term is the calibration-to-test distribution shift under family holdout (a certified ceiling of 24-49% across folds). Across ten leave-one-taxonomic-family-out folds at $α=0.05$ on UniProt KW-0800 reviewed toxins, the calibrated screener achieves 0% empirical test miss rate on every fold and 0% test false-flag rate on nine of ten folds. The bound's finite-sample slack $1/(n_{\mathrm{cal}}+1)$ caps the certifiable miss rate at 1.77% on our 200-hazard subsample; reaching procurement-grade $α=10^{-3}$ requires an $18\times$ larger calibration set, which the full reviewed UniProt KW-0800 corpus is large enough to deliver. The binding constraint on certifiable DNA-synthesis screening is calibration data, not algorithms. Code: https://github.com/najmulhasan-code/crc-screen

翻译：[translated abstract in Chinese] DNA合成供应商通过将客户提交的序列与预设危害列表进行比对来筛查订单。研究发现，当危险序列来自参考集缺失的分类科属时，该基线方法的误报率将完全失效至100%：在保形风险控制的认证漏检率约束下，低区分度信号会迫使阈值降至全部测试良性样本分布之下。本研究融合了合成订单公开注释中的三类信号：已知毒素的k-mer杰卡德相似度、五元大型语言模型评审团的修剪均值评分，以及聚类嵌入质心的余弦相似度。这些信号通过单调逻辑聚合器融合并经保形风险控制校准后，所构建的筛查器可认证其期望假阴性率≤α+TV，其中加性项表示在科属留出法下的校准至测试集的分布偏移（各折间认证上限为24-49%）。在UniProt KW-0800审核毒素数据集上，以α=0.05进行十次留一分类科属交叉验证，校准后的筛查器在每折测试集中均实现0%经验漏检率，且十折中有九折达到0%测试误报率。由于有限样本松弛项1/(n校准+1)的存在，在200条危害子样本上认证漏检率上限为1.77%；若需达到工业采购级α=10-3，需将校准集扩大18倍，而完整审核版UniProt KW-0800语料库的规模足以满足该要求。可认证DNA合成筛查的关键约束在于校准数据而非算法。代码地址：https://github.com/najmulhasan-code/crc-screen