Cost and Accuracy of Long-Term Memory in Distributed Multi-Agent Systems Based on Large Language Models

Long-term memory (LTM) is fundamental to large language model (LLM)-based agents in the emerging Internet of Agents (IoA), where distributed multi-agent systems (DMAS) span cloud and edge networks. Existing evaluations are typically published by framework providers and focus on token usage and latency, rarely accounting for system-level cost or deployment in DMAS. These gaps are addressed with an independent reproducible testbed that evaluates accuracy, latency, CPU time, peak RAM, disk I/O and network usage in a simulated cloud-edge environment. Three venture capital-funded frameworks spanning vector, graph, and hybrid architectures, namely mem0, Graphiti, and cognee, are compared alongside retrieval-augmented generation (RAG) and full-context baselines on the LoCoMo benchmark under unconstrained and constrained network scenarios. Two clusters emerge: mem0, RAG, and full-context reach 77% to 81% accuracy, while Graphiti and cognee reach only 55% to 56%, a gap driven by retrieval incompleteness rather than reasoning failure. The RAG baseline matches the upper cluster at 8.4 times lower total cost of ownership (TCO) than mem0, and both are the only non-dominated backends on the Pareto frontier. Latency and bandwidth constraints as well as jitter leave retrieval quality unchanged for every backend, while vector-based LTM incurs a modest latency penalty of 4% to 5% under edge-cloud constraints. Compression precision rather than context volume determines LTM accuracy, as full-context forwarding underperforms mem0 despite supplying the entire conversation for each question.

翻译：长期记忆（LTM）是新兴的智能体互联网（IoA）中基于大语言模型（LLM）的智能体所依赖的基础能力，其中分布式多智能体系统（DMAS）跨越云和边缘网络。现有评估通常由框架提供方发布，且主要关注令牌使用量和延迟，很少考虑系统级成本或在DMAS中的部署。为弥补这些不足，我们搭建了一个可独立复现的测试平台，在模拟的云-边缘环境中评估准确性、延迟、CPU时间、峰值内存、磁盘I/O及网络使用情况。我们比较了三个由风险投资支持的、分别基于向量、图及混合架构的框架——mem0、Graphiti和cognee，并在无约束与有约束网络场景下，基于LoCoMo基准与检索增强生成（RAG）及全上下文基线进行对比。结果表明两类聚类：mem0、RAG和全上下文达到77%至81%的准确性，而Graphiti和cognee仅达55%至56%，该差距源于检索不完整而非推理失败。RAG基线以比mem0低8.4倍的总拥有成本（TCO）达到上层聚类的准确性，且两者是帕累托前沿上仅有的非支配后端。延迟和带宽限制以及抖动对每个后端的检索质量均无影响，而基于向量的LTM在边缘-云约束下产生了4%至5%的适度延迟损失。决定LTM准确性的是压缩精度而非上下文规模，因为尽管全上下文为每个问题提供了完整对话，其表现仍不及mem0。