Retrieving relevant past interactions from long-term conversational memory typically relies on large dense retrieval models (110M-1.5B parameters) or LLM-augmented indexing. We introduce SelRoute, a framework that routes each query to a specialized retrieval pipeline -- lexical, semantic, hybrid, or vocabulary-enriched -- based on its query type. On LongMemEval_M (Wu et al., 2024), SelRoute achieves Recall@5 of 0.800 with bge-base-en-v1.5 (109M parameters) and 0.786 with bge-small-en-v1.5 (33M parameters), compared to 0.762 for Contriever with LLM-generated fact keys. A zero-ML baseline using SQLite FTS5 alone achieves NDCG@5 of 0.692, already exceeding all published baselines on ranking quality -- a gap we attribute partly to implementation differences in lexical retrieval. Five-fold stratified cross-validation confirms routing stability (CV gap of 1.3-2.4 Recall@5 points; routes stable for 4/6 query types across folds). A regex-based query-type classifier achieves 83% effective routing accuracy, and end-to-end retrieval with predicted types (Recall@5 = 0.689) still outperforms uniform baselines. Cross-benchmark evaluation on 8 additional benchmarks spanning 62,000+ instances -- including MSDialog, LoCoMo, QReCC, and PerLTQA -- confirms generalization without benchmark-specific tuning, while exposing a clear failure mode on reasoning-intensive retrieval (RECOR Recall@5 = 0.149) that bounds the claim. We also identify an enrichment-embedding asymmetry: vocabulary expansion at storage time improves lexical search but degrades embedding search, motivating per-pipeline enrichment decisions. The full system requires no GPU and no LLM inference at query time.
翻译:从长期对话记忆中检索相关历史交互通常依赖于大型密集检索模型(1.1亿至15亿参数)或基于大语言模型(LLM)的索引增强。我们提出SelRoute框架,该框架根据查询类型将每个查询路由至专门的检索管线——包括词法检索、语义检索、混合检索与词汇增强检索。在LongMemEval_M(Wu等,2024)基准上,SelRoute使用bge-base-en-v1.5(1.09亿参数)实现Recall@5为0.800,使用bge-small-en-v1.5(3300万参数)实现0.786,而Contriever结合LLM生成事实键的基线为0.762。仅使用SQLite FTS5的零机器学习基线即可实现NDCG@5为0.692,超过所有已发表基线在排序质量上的表现——这一差距部分源于词法检索的实现差异。五折分层交叉验证证实路由稳定性(交叉验证差距为1.3-2.4个Recall@5百分点;6种查询类型中有4种在折间保持路由稳定)。基于正则表达式的查询类型分类器达到83%的有效路由准确率,使用预测类型的端到端检索(Recall@5=0.689)仍优于均匀基线。在涵盖62,000余个实例的8个额外基准(包括MSDialog、LoCoMo、QReCC和PerLTQA)上的跨基准评估证实了泛化能力(无需针对基准的特定调优),同时暴露出在推理密集型检索中的明确失败模式(RECOR Recall@5=0.149)从而限定了方法边界。我们还识别出增强-嵌入非对称性:存储时扩展词汇可改善词法搜索但损害嵌入搜索,这促使我们采用管线级增强决策。该完整系统在查询时无需GPU及LLM推理。