Chinese Legal Case Retrieval (LCR) benchmarks grade a reference judgment relevant when its legal characterization matches the query, and strong systems now reach NDCG@10 of 0.85-0.88. Most of the BM25-to-best-trained gap is recoverable with no retrieval model: ranking candidates only by shared primary charge, broken by BM25, closes 99.2% of it on LeCaRDv2 -- with no detectable difference from the best-trained system. This reflects benchmark design: LeCaRDv2 defines top relevance via the crime's key constitutive elements, which encode the charge, so same-charge cases are relevant by construction (relevance lift 4.49; charge-to-relevance macro-AUC 0.871). Holding charge fixed, the trained reranker's advantage over BM25 collapses to a small within-charge residual (+0.026 NDCG@10, cluster-bootstrap CI excluding zero, about a quarter), the only non-definitional positive. The effect is not uniform: the same rule recovers 84.3% on LeCaRDv1 and is out of spec on CAIL2022, with the charge-to-relevance signal weakening in step (macro-AUC 0.871/0.759/0.728); a predicted-charge cascade reproduces 76.6% on LeCaRDv2 but does not transfer. The construct is also cashable at first stage: an exploratory zero-training charge-pool channel lifts LeCaRDv2 recall (R@100 +0.025, wrong-charge controls hurt), reported as a positive control for the confound, not a retrieval method or novelty claim. Charge is thus a high-leverage construct-validity factor at the benchmark level -- not auniform explanation of NDCG@10, and not evidence that any system relies on charge. We package established construct-validity and partial-input checks as a reusable charge-controlled protocol (CCE); on all three benchmarks its triggers come back null or descriptive, behaving as designed. We release the scripts, schema, and protocol so future benchmarks can be screened before their NDCG@10 is read as legal-reasoning ability.
翻译:中国法律案例检索(LCR)基准通过比对法律定性是否匹配查询来判定参考判决的相关性,现有强系统NDCG@10已达0.85-0.88。从BM25到最优训练模型的差距中,大部分无需检索模型即可恢复:仅依据共享主罪名(辅以BM25打破平局)对候选案例排序,即可在LeCaRDv2上消除99.2%的差距——且与最优训练系统的结果无显著差异。这反映了基准设计特征:LeCaRDv2通过犯罪构成要件(本质编码了罪名)定义最高相关性,因此同罪名案例天然具有相关性(相关性提升4.49;罪名-相关性宏观AUC达0.871)。在固定罪名条件下,训练好的重排序模型相对BM25的优势急剧收缩为微弱的同罪名残差(NDCG@10提升+0.026,聚类自助法置信区间不含零,约为原始效果的四分之一),这是唯一非定义性的正向增益。该效应并非均匀分布:相同规则在LeCaRDv1上恢复84.3%效果,但在CAIL2022上完全失效,且罪名-相关性信号强度呈阶梯式衰减(宏观AUC分别为0.871/0.759/0.728);基于预测罪名的级联方法可在LeCaRDv2上复现76.6%效果,但无法迁移。这一构念效应在首阶段同样可兑现:探索性的零训练罪名池通道使LeCaRDv2召回率提升(R@100 +0.025,错误罪名控制组呈负效应)——此处作为混淆变量的阳性对照报告,而非检索方法或创新性声明。因此,罪名是基准层面的高杠杆构念效度因素——既非NDCG@10的统一解释,亦非系统依赖罪名的证据。我们整合既有的构念效度与部分输入检查方法,构建可复用的罪名控制协议(CCE);在三个基准上,其触发结果均为空值或描述性结果,符合设计预期。我们公开脚本、模式与协议,以便未来基准在将其NDCG@10解读为法律推理能力前接受筛查。