In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).
翻译:本报告提出以下问题:根据奥林匹克竞技场(一个奥林匹克级别的、多学科、多模态的超智能AI基准测试)的评估,迄今最智能的AI模型是哪个?我们特别关注最新发布的模型:Claude-3.5-Sonnet、Gemini-1.5-Pro和GPT-4o。我们首次提出采用奥运会奖牌榜的方法,根据AI模型在各个学科上的综合表现进行排名。实证结果表明:(1)Claude-3.5-Sonnet在整体性能上展现出对GPT-4o的高度竞争力,甚至在少数科目(即物理、化学和生物学)上超越了GPT-4o。(2)Gemini-1.5-Pro和GPT-4V紧随GPT-4o和Claude-3.5-Sonnet之后,但彼此之间存在明显的性能差距。(3)开源社区的AI模型性能显著落后于这些专有模型。(4)这些模型在该基准测试上的表现仍不尽如人意,表明我们距离实现超智能仍有很长的路要走。我们将持续追踪并评估最新强大模型在该基准测试上的性能(项目地址:https://github.com/GAIR-NLP/OlympicArena)。