Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

Retrieving the few past turns that answer a new query across long multi-session histories is the retrieval bottleneck behind long-term conversational memory (LoCoMo, LongMemEval). Recent concurrent work, Nano-Memory, shows that scoring a session by the maximum query-turn similarity (late interaction, "Turn Isolation Retrieval") beats mean-pooled session embeddings. We do not claim that effect; we replicate it and ask what a training-free, CPU-only retrieval stage should add around it. We report four findings. (1) Fuse: score-level fusion of the late-interaction dense score with BM25, under a single leave-one-conversation-out weight, adds +8.8 to +17.2 points of LoCoMo Hit@1 over late interaction alone across six encoders (all p<1e-4), reaching Hit@1 0.752 / NDCG@5 0.829 (e5-large-v2), +11.2 pp over BM25. (2) An off-the-shelf web-search cross-encoder reranker over the fused top-10 hurts here, degrading Hit@1 by 6.9 pp (one reranker, one configuration). (3) A pooling-operator ablation shows top-k late interaction matches max-similarity, but a naive smooth-max (log-sum-exp) collapses for half the encoders. (4) The late-minus-early gap is large for all six encoders and tends to be larger for larger ones, while the marginal fusion gain shrinks; on LongMemEval-S, a lexical regime where BM25 saturates, the net fusion gain over BM25 is small and not significant. A per-category analysis frames the gain as a division of labor: dense late interaction helps most on multi-hop and temporal questions but trails BM25 on adversarial ones. The contribution is a controlled, reproducible account of a strong training-free retrieval recipe, not the late-interaction retriever itself (Nano-Memory's). We make no claim to a complete memory architecture; this is a retrieval-stage study.

翻译：跨多会话长历史中检索回答新查询的最近几轮对话，是长期对话记忆（LoCoMo, LongMemEval）中的检索瓶颈。近期并行工作Nano-Memory表明，通过最大查询-对话轮次相似度（后期交互，"轮次隔离检索"）对会话评分优于均值池化会话嵌入。我们并非声称该效应，而是复现该结果并探究无训练、仅CPU的检索阶段应如何增强该方案。我们报告四项发现：（1）融合：在单次留一对话验证权重下，对后期交互稠密得分与BM25进行分数级融合，在六个编码器上相比单纯后期交互提升LoCoMo的Hit@1指标8.8至17.2个百分点（所有p<1e-4），达到Hit@1 0.752 / NDCG@5 0.829（e5-large-v2），相较BM25提升11.2个百分点。（2）在融合后Top-10结果上使用现成网络搜索交叉编码器重排序会导致性能下降，Hit@1降低6.9个百分点（单重排序器、单一配置）。（3）池化算子消融实验表明，Top-k后期交互与最大相似度匹配，但朴素平滑最大值（对数-求和-指数）在半数编码器上失效。（4）六个编码器的后期-前期差距均较大，且更大编码器通常差距更大，而边际融合增益却缩小；在LongMemEval-S（BM25饱和的词汇预测场景）上，相比BM25的净融合增益微小且不显著。按类别分析将增益归因于分工：稠密后期交互在多跳和时序问题上效果最优，但在对抗性问题上落后于BM25。本文贡献在于提供一种可控、可复现的强效无训练检索方法报告，而非后期交互检索器本身（属于Nano-Memory的贡献）。我们未声称构建完整记忆架构，本研究仅限于检索阶段。