AI systems are increasingly deployed in real-world settings where their behavior is shaped by dynamic environments, evolving data distributions, and complex interactions with users and infrastructure. Traditional machine learning evaluation focuses on benchmarks and operates within sandboxed environments, providing only a limited view of the true system behavior in the wild. We argue for the development of principled auditing frameworks that monitor deployed AI systems throughout their lifecycle. We further propose framing auditing as a statistical problem of monitoring constraint violations under uncertainty, where desired properties (e.g., fairness and safety) are treated as risk-controlled constraints that must be continuously evaluated as systems evolve through iterative feedback. This perspective highlights the need for uncertainty-aware monitoring methods, socio-technical specifications of audit criteria, and auditing infrastructures that enable ongoing oversight of AI systems in the wild.
翻译:人工智能系统日益部署在行为受动态环境、演化数据分布以及与人及基础设施复杂交互影响的真实环境中。传统机器学习评估侧重于基准测试并在沙箱环境中运行,仅能提供系统在真实场景中行为的有限视角。我们主张开发原则性的审计框架,以全生命周期监测已部署的人工智能系统。进一步提出将审计框架构建为不确定性条件下约束违规监测的统计问题——将公平性、安全性等期望属性视为风险可控约束,当系统通过迭代反馈演化时必须持续评估这些约束。该视角凸显了对不确定性感知的监测方法、社会技术维度的审计标准规范,以及支持对真实环境中人工智能系统进行持续监督的审计基础设施的需求。