Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts -- hate-based rhetoric and digital empathy -- and evaluate resulting concept specs on content validity and information recoverability.
翻译:评估生成式AI(GenAI)系统具有挑战性,因为许多评估目标(如“推理能力”、“公平性”或“创造性”)是宽泛且存在争议的概念。当这些概念定义不够明确时,测量对象以及评估结果的解读方式便难以清晰界定。这一问题揭示了缺失的关键步骤——系统化,即从宽泛的背景概念出发,将其转化为以可量化术语表述的显式结构化概念。为应对系统化过程对认知能力和资源的高要求,我们探究AI辅助能否支持这一流程。为实现AI辅助的系统化并评估其质量,我们引入了结构化概念表征——概念规范(concept spec)及验证工作表。随后我们开发了两种AI辅助系统化工具:直接零样本方法,以及更贴近现有文献中手动系统化方法的多智能体方法。我们利用这些系统化工具针对“仇恨言论”和“数字共情”两个概念生成概念规范,并从内容效度和信息可恢复性两个维度评估最终概念规范的质量。