This paper presents a conceptual and operational framework for developing and operating safe and trustworthy AI agents based on a Three-Pillar Model grounded in transparency, accountability, and trustworthiness. Building on prior work in Human-in-the-Loop systems, reinforcement learning, and collaborative AI, the framework defines an evolutionary path toward autonomous agents that balances increasing automation with appropriate human oversight. The paper argues that safe agent autonomy must be achieved through progressive validation, analogous to the staged development of autonomous driving, rather than through immediate full automation. Transparency and accountability are identified as foundational requirements for establishing user trust and for mitigating known risks in generative AI systems, including hallucinations, data bias, and goal misalignment, such as the inversion problem. The paper further describes three ongoing work streams supporting this framework: public deliberation on AI agents conducted by the Stanford Deliberative Democracy Lab, cross-industry collaboration through the Safe AI Agent Consortium, and the development of open tooling for an agent operating environment aligned with the Three-Pillar Model. Together, these contributions provide both conceptual clarity and practical guidance for enabling the responsible evolution of AI agents that operate transparently, remain aligned with human values, and sustain societal trust.
翻译:本文提出了一个基于透明度、问责性与可信度三支柱模型的概念性与操作性框架,用于开发和运行安全可信的人工智能体。该框架建立在人在回路系统、强化学习和协作式人工智能的先前工作基础上,定义了一条通向自主智能体的演进路径,在提升自动化水平与保持适当人类监督之间取得平衡。本文主张,安全智能体自主性的实现必须通过渐进式验证——类似于自动驾驶的分阶段发展模式——而非立即实现完全自动化。透明度和问责性被确立为建立用户信任以及缓解生成式人工智能系统中已知风险(包括幻觉、数据偏见和目标错位,如逆转问题)的基础性要求。本文进一步描述了支持该框架的三个正在进行的工作方向:由斯坦福协商民主实验室开展的人工智能体公共审议、通过安全人工智能体联盟进行的跨行业协作,以及开发符合三支柱模型的智能体运行环境的开源工具。这些贡献共同为人工智能体的负责任演进提供了概念清晰度和实践指导,使其能够透明运作、保持与人类价值观的一致性并维持社会信任。