Charge as a Construct-Validity Factor in Chinese Legal Case Retrieval: A Cross-Benchmark Audit

Chinese Legal Case Retrieval (LCR) benchmarks grade a reference judgment relevant when its legal characterization matches the query, and strong systems now reach NDCG@10 of 0.85-0.88. Most of the BM25-to-best-trained gap is recoverable with no retrieval model: ranking candidates only by shared primary charge, broken by BM25, closes 99.2% of it on LeCaRDv2 -- with no detectable difference from the best-trained system. This reflects benchmark design: LeCaRDv2 defines top relevance via the crime's key constitutive elements, which encode the charge, so same-charge cases are relevant by construction (relevance lift 4.49; charge-to-relevance macro-AUC 0.871). Holding charge fixed, the trained reranker's advantage over BM25 collapses to a small within-charge residual (+0.026 NDCG@10, cluster-bootstrap CI excluding zero, about a quarter), the only non-definitional positive. The effect is not uniform: the same rule recovers 84.3% on LeCaRDv1 and is out of spec on CAIL2022, with the charge-to-relevance signal weakening in step (macro-AUC 0.871/0.759/0.728); a predicted-charge cascade reproduces 76.6% on LeCaRDv2 but does not transfer. The construct is also cashable at first stage: an exploratory zero-training charge-pool channel lifts LeCaRDv2 recall (R@100 +0.025, wrong-charge controls hurt), reported as a positive control for the confound, not a retrieval method or novelty claim. Charge is thus a high-leverage construct-validity factor at the benchmark level -- not auniform explanation of NDCG@10, and not evidence that any system relies on charge. We package established construct-validity and partial-input checks as a reusable charge-controlled protocol (CCE); on all three benchmarks its triggers come back null or descriptive, behaving as designed. We release the scripts, schema, and protocol so future benchmarks can be screened before their NDCG@10 is read as legal-reasoning ability.

翻译：中国法律案例检索（LCR）基准通过比对法律定性是否匹配查询来判定参考判决的相关性，现有强系统NDCG@10已达0.85-0.88。从BM25到最优训练模型的差距中，大部分无需检索模型即可恢复：仅依据共享主罪名（辅以BM25打破平局）对候选案例排序，即可在LeCaRDv2上消除99.2%的差距——且与最优训练系统的结果无显著差异。这反映了基准设计特征：LeCaRDv2通过犯罪构成要件（本质编码了罪名）定义最高相关性，因此同罪名案例天然具有相关性（相关性提升4.49；罪名-相关性宏观AUC达0.871）。在固定罪名条件下，训练好的重排序模型相对BM25的优势急剧收缩为微弱的同罪名残差（NDCG@10提升+0.026，聚类自助法置信区间不含零，约为原始效果的四分之一），这是唯一非定义性的正向增益。该效应并非均匀分布：相同规则在LeCaRDv1上恢复84.3%效果，但在CAIL2022上完全失效，且罪名-相关性信号强度呈阶梯式衰减（宏观AUC分别为0.871/0.759/0.728）；基于预测罪名的级联方法可在LeCaRDv2上复现76.6%效果，但无法迁移。这一构念效应在首阶段同样可兑现：探索性的零训练罪名池通道使LeCaRDv2召回率提升（R@100 +0.025，错误罪名控制组呈负效应）——此处作为混淆变量的阳性对照报告，而非检索方法或创新性声明。因此，罪名是基准层面的高杠杆构念效度因素——既非NDCG@10的统一解释，亦非系统依赖罪名的证据。我们整合既有的构念效度与部分输入检查方法，构建可复用的罪名控制协议（CCE）；在三个基准上，其触发结果均为空值或描述性结果，符合设计预期。我们公开脚本、模式与协议，以便未来基准在将其NDCG@10解读为法律推理能力前接受筛查。