Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is directionally more robust than min-max normalization on both tune and test splits (1W/6L, p=0.125), while Boltzmann weighting performs comparably to linear fusion after calibration (0W/3L, p=0.25). These results suggest that score commensuration is a robust design choice, and the exact post-calibration operator appears to matter less on these benchmarks.
翻译:图谱增强检索将稠密相似性与基于图的关联信号(如个性化PageRank,PPR)相结合,但这些分数具有不同分布且无法直接比较。我们将此问题作为多跳问答中异构检索融合的分数校准问题进行研究。我们的方法PhaseGraph在融合前利用百分位数秩归一化(PIT)将向量分数和图谱分数映射至统一的无量纲尺度,从而在不丢弃幅度信息的前提下实现稳定组合。在MuSiQue和2WikiMultiHopQA数据集上,校准融合提升了HippoRAG2风格基准的保留最后跳检索性能:MuSiQue上的LastHop@5从75.1%提升至76.5%(8W/1L,p=0.039),2WikiMultiHopQA上从51.7%提升至53.6%(11W/2L,p=0.023),两项结果均基于独立保留测试集。理论驱动的消融实验表明,在调优集和测试集上(1W/6L,p=0.125),基于百分位数的校准在方向性上比最小-最大归一化更稳健,而校准后玻尔兹曼加权与线性融合性能相当(0W/3L,p=0.25)。这些结果表明分数同量纲化是一种稳健的设计选择,且在这些基准上校准后的具体融合算子影响似乎较小。