Fanar 2.0：阿拉伯语生成式人工智能技术栈 (Fanar 2.0: Arabic Generative AI Stack)

FANAR TEAM,Ummar Abbas,Mohammad Shahmeer Ahmad,Minhaj Ahmad,Abdulaziz Al-Homaid,Anas Al-Nuaimi,Enes Altinisik,Ehsaneddin Asgari,Sanjay Chawla,Shammur Chowdhury,Fahim Dalvi,Kareem Darwish,Nadir Durrani,Mohamed Elfeky,Ahmed Elmagarmid,Mohamed Eltabakh,Asim Ersoy,Masoomali Fatehkia,Mohammed Qusay Hashim,Majd Hawasly,Mohamed Hefeeda,Mus'ab Husaini,Keivin Isufaj,Soon-Gyo Jung,Houssam Lachemat,Ji Kim Lucas,Abubakr Mohamed,Tasnim Mohiuddin,Basel Mousi,Hamdy Mubarak,Ahmad Musleh,Mourad Ouzzani,Amin Sadeghi,Husrev Taha Sencar,Mohammed Shinoy,Omar Sinan,Yifan Zhang

We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

翻译：我们推出Fanar 2.0，这是卡塔尔以阿拉伯语为中心的生成式人工智能平台的第二代版本。主权性是其首要设计原则：从数据管道到部署基础设施的每个组件，均在哈马德·本·哈利法大学的卡塔尔计算研究所内完全自主设计与运营。Fanar 2.0是一个资源受限条件下追求卓越的典范：该项目仅使用256个NVIDIA H100 GPU运行，而阿拉伯语虽拥有约4亿母语者，其网络数据占比仅约0.5%。Fanar 2.0采取了严谨的策略，强调数据质量而非数量，进行有针对性的持续预训练，并采用模型融合技术，从而在这些限制条件下实现了显著提升。其核心是Fanar-27B模型，该模型以Gemma-3-27B为骨干，基于三种数据配方构建的1200亿高质量词元的精选语料库进行持续预训练。尽管其预训练词元数量比Fanar 1.0少8倍，但在多项基准测试中取得了显著进步：阿拉伯语知识（+9.1分）、语言能力（+7.3分）、方言理解（+3.5分）以及英语能力（+7.6分）。除了核心大语言模型，Fanar 2.0还引入了一系列丰富的新功能。FanarGuard是一个最先进的40亿参数双语审核过滤器，用于保障阿拉伯语内容安全与文化对齐。语音模型家族Aura新增了支持数小时音频的长格式自动语音识别模型。视觉模型家族Oryx增加了具备阿拉伯语意识的图像与视频理解能力，以及基于文化的图像生成功能。一个支持工具调用的智能体框架实现了多步骤工作流。Fanar-Sadiq采用多智能体架构处理伊斯兰相关内容。Fanar-Diwan提供古典阿拉伯诗歌生成。FanarShaheen提供基于大语言模型的双语翻译。重新设计的多层编排器通过意图感知路由和纵深防御安全验证来协调所有组件。总而言之，Fanar 2.0证明了在主权性和资源受限条件下进行的人工智能开发，能够构建出与那些以更大规模构建的系统相竞争的系统。