Assessing Language Model Deployment with Risk Cards

This paper introduces RiskCards, a framework for structured assessment and documentation of risks associated with an application of language models. As with all language, text generated by language models can be harmful, or used to bring about harm. Automating language generation adds both an element of scale and also more subtle or emergent undesirable tendencies to the generated text. Prior work establishes a wide variety of language model harms to many different actors: existing taxonomies identify categories of harms posed by language models; benchmarks establish automated tests of these harms; and documentation standards for models, tasks and datasets encourage transparent reporting. However, there is no risk-centric framework for documenting the complexity of a landscape in which some risks are shared across models and contexts, while others are specific, and where certain conditions may be required for risks to manifest as harms. RiskCards address this methodological gap by providing a generic framework for assessing the use of a given language model in a given scenario. Each RiskCard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. While RiskCards are designed to be open-source, dynamic and participatory, we present a "starter set" of RiskCards taken from a broad literature survey, each of which details a concrete risk presentation. Language model RiskCards initiate a community knowledge base which permits the mapping of risks and harms to a specific model or its application scenario, ultimately contributing to a better, safer and shared understanding of the risk landscape.

翻译：本文介绍了一种名为RiskCards的框架，用于对语言模型应用中的相关风险进行结构化评估与文档化。与所有语言一样，语言模型生成的文本可能具有危害性，或被用于制造危害。自动化语言生成不仅带来了规模效应，还使文本中出现了更微妙或新出现的非理想倾向。先前的研究已确立语言模型对众多不同行为者造成的多种危害：现有分类法识别了语言模型引发的危害类别；基准测试建立了这些危害的自动化测试方法；模型、任务和数据集的文档标准则鼓励透明化报告。然而，当前缺乏一个以风险为中心的框架来记录复杂场景——其中某些风险在模型和情境间具有共性，而另一些则具有特异性，且某些风险需特定条件才能演变为实际危害。RiskCards通过提供通用框架填补了这一方法论空白，用于评估给定语言模型在特定场景中的使用。每张风险卡均清晰阐明风险演变为危害的路径、其在危害分类体系中的定位，以及示例提示-输出对。尽管RiskCards被设计为开源、动态且参与式框架，我们仍通过广泛文献调研呈现了一组"初始风险卡集"，每张卡片详述一种具体风险表现。语言模型风险卡将构建一个社区知识库，支持将风险与危害映射至特定模型或其应用场景，最终促进对风险格局更完善、更安全且更具共识性的理解。