In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
翻译:本报告介绍了Gemini系列的最新模型——Gemini 1.5 Pro,这是一种高效计算的多模态混合专家模型,能够从数百万词元的上下文中(包括多个长文档以及数小时的视频和音频)回忆并推理细粒度信息。Gemini 1.5 Pro在多模态长上下文检索任务中实现了近乎完美的召回率,在长文档问答、长视频问答和长上下文自动语音识别任务中提升了现有最优水平,并在广泛基准测试中达到或超越了Gemini 1.0 Ultra的最优性能。通过研究Gemini 1.5 Pro长上下文能力的极限,我们发现其下一词元预测持续改进,且检索近乎完美(>99%)的能力可覆盖至少1000万词元,这一代际飞跃远超Claude 2.1(20万词元)和GPT-4 Turbo(12.8万词元)等现有模型。最后,我们揭示了前沿大语言模型令人惊叹的新能力:当提供全球使用人数不足200人的科拉曼语语法手册时,该模型能将英语翻译为科拉曼语,其水平与基于相同内容学习的人类译者相当。