The recently released Google Gemini class of models are the first to comprehensively report results that rival the OpenAI GPT series across a wide variety of tasks. In this paper, we do an in-depth exploration of Gemini's language abilities, making two contributions. First, we provide a third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results. Second, we take a closer look at the results, identifying areas where one of the two model classes excels. We perform this analysis over 10 datasets testing a variety of language abilities, including reasoning, answering knowledge-based questions, solving math problems, translating between languages, generating code, and acting as instruction-following agents. From this analysis, we find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo on all tasks that we benchmarked. We further provide explanations for some of this under-performance, including failures in mathematical reasoning with many digits, sensitivity to multiple-choice answer ordering, aggressive content filtering, and others. We also identify areas where Gemini demonstrates comparably high performance, including generation into non-English languages, and handling longer and more complex reasoning chains. Code and data for reproduction can be found at https://github.com/neulab/gemini-benchmark
翻译:近期发布的Google Gemini系列模型首次在广泛任务上全面报告了可与OpenAI GPT系列相媲美的结果。本文深入探究了Gemini的语言能力,主要贡献有二:其一,我们通过可复现的代码和完全透明的结果,对OpenAI GPT模型与Google Gemini模型的能力进行了第三方客观对比;其二,我们进一步审视了二者的表现差异,明确了各自擅长的领域。我们在10个数据集上开展分析,测试了多种语言能力,包括推理、知识问答、数学解题、语言翻译、代码生成及指令遵循代理行为。分析发现,Gemini Pro在所有基准测试任务中的准确率均接近但略逊于对应的GPT 3.5 Turbo。我们进一步解释了部分性能不足的原因,包括多位数学推理失误、对多选题选项顺序敏感性、激进内容过滤机制等。同时,我们识别出Gemini表现同样出色的领域,如非英语语言生成、处理更长更复杂的推理链等。可复现实验代码与数据参见https://github.com/neulab/gemini-benchmark