Agent Skill framework, now widely and officially supported by major players such as GitHub Copilot, LangChain, and OpenAI, performs especially well with proprietary models by improving context engineering, reducing hallucinations, and boosting task accuracy. Based on these observations, an investigation is conducted to determine whether the Agent Skill paradigm provides similar benefits to small language models (SLMs). This question matters in industrial scenarios where continuous reliance on public APIs is infeasible due to data-security and budget constraints requirements, and where SLMs often show limited generalization in highly customized scenarios. This work introduces a formal mathematical definition of the Agent Skill process, followed by a systematic evaluation of language models of varying sizes across multiple use cases. The evaluation encompasses two open-source tasks and a real-world insurance claims data set. The results show that tiny models struggle with reliable skill selection, while moderately sized SLMs (approximately 12B - 30B) parameters) benefit substantially from the Agent Skill approach. Moreover, code-specialized variants at around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency. Collectively, these findings provide a comprehensive and nuanced characterization of the capabilities and constraints of the framework, while providing actionable insights for the effective deployment of Agent Skills in SLM-centered environments.
翻译:智能体技能框架现已获得GitHub Copilot、LangChain及OpenAI等主要厂商的广泛官方支持,该框架通过优化上下文工程、减少幻觉现象及提升任务准确性,在专有模型上表现尤为突出。基于上述观察,本研究旨在探究智能体技能范式能否为小型语言模型带来类似增益。该问题在工业场景中具有重要意义,因为此类场景常因数据安全与预算限制而无法持续依赖公共API接口,且小型语言模型在高度定制化场景中往往表现出有限的泛化能力。本文首先提出智能体技能过程的正式数学定义,进而系统评估了不同规模的语言模型在多种用例中的表现。评估涵盖两项开源任务及一个真实世界的保险理赔数据集。结果表明:微型模型在可靠技能选择方面存在困难,而中等规模的小型语言模型(约12B-30B参数)则能从智能体技能方法中显著获益。此外,约80B参数的代码专用变体在保持GPU效率优势的同时,其性能已达到闭源基准模型的水平。总体而言,这些发现全面而细致地刻画了该框架的能力与局限,同时为在以小型语言模型为核心的环境中有效部署智能体技能提供了可操作的见解。