Cultural AI benchmarks often rely on implicit assumptions about measured constructs, leading to vague formulations with poor validity and unclear interrelations. We propose exposing these assumptions using explicit cognitive models formulated as Structural Equation Models. Using cross-lingual alignment transfer as an example, we show how this approach can answer key research questions and identify missing datasets. This framework grounds benchmark construction theoretically and guides dataset development to improve construct measurement. By embracing transparency, we move towards more rigorous, cumulative AI evaluation science, challenging researchers to critically examine their assessment foundations.
翻译:文化人工智能基准测试通常依赖于对测量构念的隐含假设,导致表述模糊、效度不足且相互关系不明确。我们提出使用结构方程模型形式化的显性认知模型来揭示这些假设。以跨语言对齐迁移为例,我们展示了该方法如何解答关键研究问题并识别缺失数据集。该框架为基准构建提供了理论基础,并通过指导数据集开发来改进构念测量。通过倡导透明度,我们推动人工智能评估科学向更严谨、可累积的方向发展,促使研究者批判性审视其评估基础。