We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.
翻译:我们提出了SimpleQA基准测试,用于评估语言模型回答简短事实性问题的能力。在设计该评估时,我们优先考虑了两个特性:首先,SimpleQA具有挑战性,因为其问题是通过对抗GPT-4的响应而收集的;其次,其响应易于评分,因为每个问题都设计为仅存在唯一且无可争议的答案。SimpleQA中的每个答案将被评定为正确、错误或未尝试。理想行为的模型应尽可能答对更多问题,同时对于不确定正确答案的问题选择不回答。SimpleQA是一个简单而针对性强的评估工具,用于检验模型是否"知其所能知",我们希望该基准测试能在未来几代前沿模型中保持其参考价值。SimpleQA可通过https://github.com/openai/simple-evals获取。