Agent Skill framework, now widely and officially supported by major players such as GitHub Copilot, LangChain, and OpenAI, performs especially well with proprietary models by improving context engineering, reducing hallucinations, and boosting task accuracy. Based on these observations, an investigation is conducted to determine whether the Agent Skill paradigm provides similar benefits to small language models (SLMs). This question matters in industrial scenarios where continuous reliance on public APIs is infeasible due to data-security and budget constraints requirements, and where SLMs often show limited generalization in highly customized scenarios. This work introduces a formal mathematical definition of the Agent Skill process, followed by a systematic evaluation of language models of varying sizes across multiple use cases. The evaluation encompasses two open-source tasks and a real-world insurance claims data set. The results show that tiny models struggle with reliable skill selection, while moderately sized SLMs (approximately 12B - 30B) parameters) benefit substantially from the Agent Skill approach. Moreover, code-specialized variants at around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency. Collectively, these findings provide a comprehensive and nuanced characterization of the capabilities and constraints of the framework, while providing actionable insights for the effective deployment of Agent Skills in SLM-centered environments.
翻译:智能体技能框架目前得到GitHub Copilot、LangChain和OpenAI等主流平台的广泛官方支持,该框架通过改进上下文工程、减少幻觉现象并提升任务准确性,在专有模型上表现尤为突出。基于这些观察,本研究旨在探究智能体技能范式能否为小规模语言模型带来类似优势。该问题在工业场景中具有重要意义,因为这类场景往往因数据安全和预算限制而无法持续依赖公共API,且小规模语言模型在高度定制化场景中常表现出有限的泛化能力。本文首先提出了智能体技能过程的正式数学定义,随后系统评估了不同规模的语言模型在多种用例中的表现。评估涵盖两项开源任务和一个真实世界的保险理赔数据集。结果表明,微型模型在可靠技能选择方面存在困难,而中等规模的小型语言模型(约12B-30B参数)则能从智能体技能方法中显著获益。此外,约80B参数的代码专用变体在达到闭源基线可比性能的同时,还提升了GPU效率。总体而言,这些发现全面而细致地刻画了该框架的能力与局限,同时为在以小规模语言模型为中心的环境中有效部署智能体技能提供了可操作的见解。