Memory-augmented language agents rely on embedding models for effective memory retrieval. However, existing training data construction overlooks a critical limitation: the hierarchical difficulty of negative samples and their natural distribution in human-agent interactions. In practice, some negatives are semantically close distractors while others are trivially irrelevant, and natural dialogue exhibits structured proportions of these types. Current approaches using synthetic or uniformly sampled negatives fail to reflect this diversity, limiting embedding models' ability to learn nuanced discrimination essential for robust memory retrieval. In this work, we propose a principled data construction framework HiNS that explicitly models negative sample difficulty tiers and incorporates empirically grounded negative ratios derived from conversational data, enabling the training of embedding models with substantially improved retrieval fidelity and generalization in memory-intensive tasks. Experiments show significant improvements: on LoCoMo, F1/BLEU-1 gains of 3.27%/3.30%(MemoryOS) and 1.95%/1.78% (Mem0); on PERSONAMEM, total score improvements of 1.19% (MemoryOS) and 2.55% (Mem0).
翻译:记忆增强型语言智能体依赖嵌入模型实现有效的记忆检索。然而,现有训练数据构建方法忽视了一个关键局限:负样本的层次化难度及其在人机交互中的自然分布。实践中,部分负样本是语义相近的干扰项,而其他则是明显无关的样本,且自然对话中这些类型呈现出结构化比例。当前使用合成或均匀采样负样本的方法无法反映这种多样性,限制了嵌入模型学习细微判别能力,而这种能力对于稳健的记忆检索至关重要。本研究提出一种原则性数据构建框架HiNS,该框架显式建模负样本难度层级,并整合基于对话数据实证得出的负样本比例,从而能够训练出在记忆密集型任务中具有显著提升的检索保真度与泛化能力的嵌入模型。实验表明显著改进:在LoCoMo数据集上,F1/BLEU-1指标分别提升3.27%/3.30%(MemoryOS)和1.95%/1.78%(Mem0);在PERSONAMEM数据集上,总分分别提升1.19%(MemoryOS)和2.55%(Mem0)。