Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression methods.We have released our code at https://github.com/bebr2/SKIM .
翻译:大型语言模型(LLMs)广泛用于处理具有自主工作流的复杂任务。近年来,可复用的自然语言技能已成为一种流行范式,用于向LLM应用注入过程性知识。由于常用技能常被重复调用,在上下文中嵌入完整文本会显著增加预填充成本和延迟。虽然文本压缩技术有可能解决该问题,但现有方法大多针对文档中的事实性知识压缩设计,而非过程性知识,因此不足以满足技能压缩需求。本文提出,有效的技能压缩方法应满足:1)保留工作流与工具协议间的逻辑依赖关系;2)支持对频繁更新的社区技能进行轻量级离线压缩;3)能够适应不同技能复杂度的变化。为此,我们提出SKIM(SKIll coMpression),一种面向过程性技能的自适应多分辨率软令牌压缩框架。根据每个技能的复杂度,SKIM生成不同数量的软令牌,既能提升LLM推理效率,又能保持技能使用的有效性。实验表明,SKIM可将技能压缩至原始令牌长度的30%至60%,同时比现有压缩方法更有效地保持任务性能。我们已在https://github.com/bebr2/SKIM 开源代码。