Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Phuong-Hang Le,Valentin Pelloin,Arnault Chatelain,Maryem Bouziane,Mohammed Ghennai,Qianwen Guan,Kirill Milintsevich,Salima Mdhaffar,Aidan Mannion,Nils Defauw,Shuyue Gu,Alexandre Audibert,Marco Dinarelli,Yannick Estève,Lorraine Goeuriot,Steffen Lalande,Nicolas Hervé,Maximin Coavoux,François Portet,Étienne Ollion,Marie Candito,Maxime Peyrard,Solange Rossato,Benjamin Lecouteux,Aurélie Nardy,Gilles Sérasset,Vincent Segonne,Solène Evain,Diandra Fabre,Didier Schwab

We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.

翻译：我们发布了Pantagruel模型系列，这是一个面向法语文本与语音的新型自监督编码器模型家族。Pantagruel并非预测针对特定模态的目标（如文本标记或语音单元），而是在特征空间中学习上下文相关的目标表示，从而使面向特定模态的编码器能够更有效地捕捉语言与声学规律。我们分别在大型法语语料库上对独立模型进行了预训练：文本部分涵盖Wikipedia、OSCAR与CroissantLLM，语音部分则包含MultilingualLibriSpeech、LeBenchmark以及新引入的INA-100k语料库。INA-100k是从法国国家视听研究所（INA）——法国广播电视广播的国家档案库——的档案中提取的10万小时法语音频语料库，提供了高度多样化的音频数据。我们在涵盖两种模态的广泛下游任务中对Pantagruel进行了评估，包括来自FLUE、LeBenchmark等标准法语基准测试的任务。在这些任务中，Pantagruel模型相较于CamemBERT、FlauBERT和LeBenchmark2.0等强效法语基线模型，展现出具有竞争力或更优的性能，同时保持了能够无缝处理语音或文本输入的共享架构。这些结果证实了特征空间自监督目标在法语表示学习中的有效性，并凸显了Pantagruel作为多模态语音-文本理解任务的稳健基础模型。