In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
翻译:在本报告中,我们介绍了Gemini 1.5系列模型,它代表了下一代高效计算的多模态模型,能够回忆和推理来自数百万标记上下文中的细粒度信息,这些上下文包括多份长文档以及数小时的视频和音频。该系列包含两个新模型:(1)更新的Gemini 1.5 Pro,其在绝大多数能力和基准测试上超越了二月份版本;(2)Gemini 1.5 Flash,一个为效率设计的更轻量级变体,在质量上仅有极小的回归。Gemini 1.5模型在多模态长上下文检索任务上实现了近乎完美的召回率,在长文档问答、长视频问答和长上下文自动语音识别方面提升了技术水平,并在广泛的基准测试中达到或超越了Gemini 1.0 Ultra的先进性能。通过研究Gemini 1.5长上下文能力的极限,我们发现其在下一个标记预测方面持续改进,并且在至少1000万标记范围内实现了近乎完美的检索(>99%),这相对于现有模型(如Claude 3.0的20万和GPT-4 Turbo的12.8万)是一个代际飞跃。最后,我们重点介绍了实际应用案例,例如Gemini 1.5与专业人士协作完成任务,在10个不同职业类别中实现了26%到75%的时间节省,以及前沿大语言模型令人惊讶的新能力;当给定Kalamang语(一种全球使用人数不足200人的语言)的语法手册时,该模型学会了将英语翻译成Kalamang语,其水平与从相同内容学习的人相当。