Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical evidence of real-world failure modes remains underexplored. In this work, we introduce RealHarm, a dataset of annotated problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer's perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. We empirically evaluate state-of-the-art guardrails and content moderation systems to probe whether such systems would have prevented the incidents, revealing a significant gap in the protection of AI applications.
翻译:面向消费者的语言模型应用部署带来了诸多风险。尽管现有关于此类应用危害与风险的研究多采用基于监管框架和理论分析的自上而下方法,但针对真实世界故障模式的实证证据仍显不足。本研究提出了RealHarm——一个通过对公开报道事件进行系统性梳理而构建的、包含标注问题交互的AI智能体数据集。从部署者视角专门分析危害、成因与风险后,我们发现声誉损害构成了最主要的组织性危害,而错误信息则是最常见的风险类别。我们通过实证评估最先进的防护栏与内容审核系统,检验此类系统能否预防相关事件,结果揭示了AI应用保护方面存在的显著差距。