We present ConvoCache, a conversational caching system that solves the problem of slow and expensive generative AI models in spoken chatbots. ConvoCache finds a semantically similar prompt in the past and reuses the response. In this paper we evaluate ConvoCache on the DailyDialog dataset. We find that ConvoCache can apply a UniEval coherence threshold of 90% and respond to 89% of prompts using the cache with an average latency of 214ms, replacing LLM and voice synthesis that can take over 1s. To further reduce latency we test prefetching and find limited usefulness. Prefetching with 80% of a request leads to a 63% hit rate, and a drop in overall coherence. ConvoCache can be used with any chatbot to reduce costs by reducing usage of generative AI by up to 89%.
翻译:本文提出ConvoCache,一种面向语音聊天机器人的对话缓存系统,旨在解决生成式AI模型响应速度慢、计算成本高的问题。该系统通过检索历史中语义相似的提示词并复用其对应响应来实现优化。本研究基于DailyDialog数据集对ConvoCache进行评估。实验表明:在设定UniEval连贯性阈值为90%的条件下,ConvoCache可对89%的提示词通过缓存生成响应,平均延迟为214毫秒,而替代的LLM与语音合成方案通常需要超过1秒。为进一步降低延迟,我们测试了预取策略但发现其效果有限:当预取请求内容的80%时,命中率为63%,且整体连贯性出现下降。ConvoCache可适配各类聊天机器人系统,通过将生成式AI调用量降低最高89%来实现成本优化。