An In-depth Look at Gemini's Language Abilities

The recently released Google Gemini class of models are the first to comprehensively report results that rival the OpenAI GPT series across a wide variety of tasks. In this paper, we do an in-depth exploration of Gemini's language abilities, making two contributions. First, we provide a third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results. Second, we take a closer look at the results, identifying areas where one of the two model classes excels. We perform this analysis over 10 datasets testing a variety of language abilities, including reasoning, answering knowledge-based questions, solving math problems, translating between languages, generating code, and acting as instruction-following agents. From this analysis, we find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo on all tasks that we benchmarked. We further provide explanations for some of this under-performance, including failures in mathematical reasoning with many digits, sensitivity to multiple-choice answer ordering, aggressive content filtering, and others. We also identify areas where Gemini demonstrates comparably high performance, including generation into non-English languages, and handling longer and more complex reasoning chains. Code and data for reproduction can be found at https://github.com/neulab/gemini-benchmark

翻译：近期发布的Google Gemini系列模型首次在广泛任务上全面报告了可与OpenAI GPT系列相媲美的结果。本文深入探究了Gemini的语言能力，主要贡献有二：其一，我们通过可复现的代码和完全透明的结果，对OpenAI GPT模型与Google Gemini模型的能力进行了第三方客观对比；其二，我们进一步审视了二者的表现差异，明确了各自擅长的领域。我们在10个数据集上开展分析，测试了多种语言能力，包括推理、知识问答、数学解题、语言翻译、代码生成及指令遵循代理行为。分析发现，Gemini Pro在所有基准测试任务中的准确率均接近但略逊于对应的GPT 3.5 Turbo。我们进一步解释了部分性能不足的原因，包括多位数学推理失误、对多选题选项顺序敏感性、激进内容过滤机制等。同时，我们识别出Gemini表现同样出色的领域，如非英语语言生成、处理更长更复杂的推理链等。可复现实验代码与数据参见https://github.com/neulab/gemini-benchmark

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日