Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT-v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, both models show a drop of 86-97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator's uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8-point gap due to ConstBERT's sparse centroid coverage, and fine-tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone. Code: https://github.com/utshabkg/multi-vector-reproducibility.
翻译:可重复性必须验证架构的鲁棒性,而不仅仅是数值精度。我们从五个维度评估ColBERT-v2和ConstBERT,发现ConstBERT在MS-MARCO上的MRR@10重现误差在0.05%以内,但两个模型在处理长叙事性查询(TREC ToT 2025)时,性能下降86-97%。消融实验证明,这种失效源于架构本身:当查询长度达20词时性能趋于停滞,因为MaxSim算子的均匀词项权重无法区分有效信息与噪音干扰。此外,未文档化的后端参数因ConstBERT的稀疏质心覆盖导致8个百分点性能差异,而采用3倍数据量进行微调反而使性能下降达29%。我们的结论是:多向量检索中的架构约束无法仅通过适配策略克服。代码:https://github.com/utshabkg/multi-vector-reproducibility。