CLI-Anything: Towards Agent-Native Computer Use

As large language models advance in reasoning and tool use capabilities, researchers increasingly seek to leverage them for computer use agents that can interact with existing software. The dominant approach develops GUI agents that control applications through visual interfaces: interpreting screenshots, locating UI elements, and executing mouse clicks to mimic human interaction. This GUI-centric paradigm fundamentally misaligns with agent capabilities. Current GUI agents struggle with brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes. They force agents to emulate human perceptual limitations rather than leverage their computational strengths in structured data processing and programmatic control. CLI-Anything argues for agent-native computer use design. Instead of forcing agents to navigate visual layouts, we create interfaces aligned with how agents naturally operate: through structured commands, explicit state representations, and deterministic feedback. We transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols optimized for AI-native interaction. This eliminates the lossy visual-to-computational translation that plagues GUI agents. Rather than building sophisticated screen readers and click simulators, we should redesign interaction paradigms around agent strengths: precise programmatic control and deterministic execution. We examine the methodology, architecture, evidence, and future directions for this agent-native transformation of computer use. We have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision. The platform provides methodology, architecture, and infrastructure for this fundamental transformation of computer use.

翻译：随着大型语言模型在推理与工具使用能力上的进步，研究者日益寻求将其应用于可与现有软件交互的计算机使用智能体。当前主流方法开发图形用户界面智能体，通过视觉界面控制应用程序：解释屏幕截图、定位UI元素并执行鼠标点击，以模仿人类交互。这种以GUI为中心的模式从根本上与智能体能力错位。当前的GUI智能体面临脆弱的像素级交互、时序依赖以及随界面变化而失效的基于坐标的操作等困境，迫使智能体模拟人类感知局限，而非利用其在结构化数据处理与程序化控制方面的计算优势。CLI-Anything倡导智能体原生的计算机使用设计。我们不再强制智能体导航视觉布局，而是创建与智能体自然操作方式对齐的接口——通过结构化命令、显式状态表示与确定性反馈。我们将现有应用转化为命令行框架，在保留功能的同时，暴露为AI原生交互优化的机器可读协议，从而消除困扰GUI智能体的有损视觉到计算转换。与其构建复杂的屏幕阅读器与点击模拟器，我们应围绕智能体优势——精确的程序化控制与确定性执行——重新设计交互范式。本文探讨了这种智能体原生计算机使用转型的方法论、架构、证据与未来方向。我们构建了CLI-Hub作为实现此智能体原生计算机使用愿景的综合平台，为计算机使用的根本性转型提供方法论、架构与基础设施。