CacheMind: From Miss Rates to Why -- Natural-Language, Trace-Grounded Reasoning for Cache Replacement

Cache replacement remains a challenging problem in CPU microarchitecture, often addressed using hand-crafted heuristics, limiting cache performance. Cache data analysis requires parsing millions of trace entries with manual filtering, making the process slow and non-interactive. To address this, we introduce CacheMind, a conversational tool that uses Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to enable semantic reasoning over cache traces. Architects can now ask natural language questions like, "Why is the memory access associated with PC X causing more evictions?", and receive trace-grounded, human-readable answers linked to program semantics for the first time. To evaluate CacheMind, we present CacheMindBench, the first verified benchmark suite for LLM-based reasoning for the cache replacement problem. Using the SIEVE retriever, CacheMind achieves 66.67% on 75 unseen trace-grounded questions and 84.80% on 25 unseen policy-specific reasoning tasks; with RANGER, it achieves 89.33% and 64.80% on the same evaluations. Additionally, with RANGER, CacheMind achieves 100% accuracy on 4 out of 6 categories in the trace-grounded tier of CacheMindBench. Compared to LlamaIndex (10% retrieval success), SIEVE achieves 60% and RANGER achieves 90%, demonstrating that existing Retrieval-Augmented Generation (RAGs) are insufficient for precise, trace-grounded microarchitectural reasoning. We provided four concrete actionable insights derived using CacheMind, wherein bypassing use case improved cache hit rate by 7.66% and speedup by 2.04%, software fix use case gives speedup of 76%, and Mockingjay replacement policy use case gives speedup of 0.7%; showing the utility of CacheMind on non-trivial queries that require a natural-language interface.

翻译：缓存替换在CPU微体系结构中仍是一个具有挑战性的问题，通常采用手工设计的启发式方法处理，这限制了缓存性能。缓存数据分析需要解析数百万条追踪记录并进行人工筛选，导致该过程缓慢且缺乏交互性。为此，我们提出了CacheMind，这是一个基于检索增强生成（RAG）和大语言模型（LLM）的对话式工具，能够对缓存追踪记录进行语义推理。架构师现在可以提出自然语言问题，例如“为什么与程序计数器X相关的内存访问导致了更多的逐出？”，并首次获得基于追踪记录、可关联到程序语义且易于理解的答案。为评估CacheMind，我们提出了CacheMindBench，这是首个针对缓存替换问题基于LLM推理的验证基准测试集。使用SIEVE检索器时，CacheMind在75个未见过的基于追踪记录的问题上达到66.67%的准确率，在25个未见过的策略特定推理任务上达到84.80%；使用RANGER时，在相同评估中分别达到89.33%和64.80%。此外，在CacheMindBench的基于追踪记录层级中，CacheMind配合RANGER在6个类别中的4个实现了100%的准确率。与LlamaIndex（检索成功率10%）相比，SIEVE达到60%，RANGER达到90%，这表明现有的检索增强生成（RAG）方法不足以支持精确的、基于追踪记录的微体系结构推理。我们利用CacheMind推导出四项具体可行的优化建议：其中旁路用例使缓存命中率提升7.66%、加速比提升2.04%，软件修复用例实现76%的加速比，Mockingjay替换策略用例带来0.7%的加速比；这证明了CacheMind在处理需要自然语言接口的复杂查询时的实用性。