Industry practitioners and academic researchers regularly use multi-agent systems to accelerate their work, but the applications through which users operate these systems do not provide a simple, unified mechanism for scalably managing critical components of the agent harness. This lack of control adversely impacts both the quality of individual human-agent interactions and reduces the capacity for practitioners to coordinate context engineering efforts. The behavioral specifications that define what agents in such systems can do remain fragmented across prose instruction files -- for which compliance cannot be guaranteed -- or framework-internal configurations, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to context, we introduce a context-agent-tool (CAT) data layer expressed through interrelated plain-text files, allowing users to directly declare tool access for each agent and to modify the tools themselves that are used by the agents when processing. We demonstrate capability of this CAT data layer to enable real agentic usage by using a command-line shell that loads the team and executes agent runs -- \texttt{npcsh} -- and evaluating 22 locally-hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi-step scripting, tool chaining, and multi-agent delegation. We characterize which model families succeed in certain task categories and where they break down across $\sim$2500 total executions.
翻译:工业从业者与学术研究者常借助多智能体系统加速工作,但用户操作这些系统的应用程序未能提供统一简便的机制来可扩展地管理智能体框架的关键组件。这种控制缺失既损害单次人机交互质量,又削弱从业者协调上下文工程的能力。定义系统内智能体行为规范的配置文件仍分散在无法保证合规性的自然语言指令文件或框架内部配置中,导致这些规范难以跨团队和项目共享、版本化或协同维护。借鉴辐射安全领域的ALARA原则(将暴露控制在合理可行的最低水平),我们提出通过互相关联的纯文本文件表达的上下文-智能体-工具(CAT)数据层,允许用户直接声明每个智能体的工具访问权限,并修改智能体处理时使用的工具本身。我们通过加载团队并执行智能体运行的命令行外壳——\texttt{npcsh}——展示该CAT数据层实现真实智能体使用的能力,并在涵盖文件操作、网络搜索、多步脚本、工具链调用及多智能体委托的115项实际任务中,评估了22个参数量从0.6B到35B的本地托管模型。通过对约2500次执行结果的统计,我们刻画了不同模型家族在特定任务类别中的成功模式及其失效边界。