There are few principles or guidelines to ensure evaluations of generative AI (GenAI) models and systems are effective. To help address this gap, we propose a set of general dimensions that capture critical choices involved in GenAI evaluation design. These dimensions include the evaluation setting, the task type, the input source, the interaction style, the duration, the metric type, and the scoring method. By situating GenAI evaluations within these dimensions, we aim to guide decision-making during GenAI evaluation design and provide a structure for comparing different evaluations. We illustrate the utility of the proposed set of general dimensions using two examples: a hypothetical evaluation of the fairness of a GenAI system and three real-world GenAI evaluations of biological threats.
翻译:当前缺乏确保生成式人工智能(GenAI)模型与系统评估有效性的原则或指导方针。为弥补这一不足,本文提出了一套通用维度,用以捕捉GenAI评估设计中的关键选择。这些维度包括:评估设定、任务类型、输入来源、交互方式、持续时间、度量类型以及评分方法。通过将GenAI评估置于这些维度框架内,我们旨在指导GenAI评估设计过程中的决策,并为比较不同评估提供结构化基础。我们通过两个示例说明所提通用维度集的实际效用:一个是对GenAI系统公平性的假设性评估,另一个是三个针对生物威胁的真实世界GenAI评估案例。