Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.
翻译:使用不足语言的使用者面临双重障碍:语言障碍(因多数在线知识仅以少数主导语言呈现)与模态障碍(因信息主要基于文本,而许多语言以口语为主)。针对法语-沃洛夫语场景,我们通过训练首个双语语音-文本嵌套嵌入模型来解决此问题,该模型能够在不依赖昂贵自动语音识别-翻译流程的情况下,实现从沃洛夫语语音查询到法语文本的高效检索。我们提出了大规模数据整理流程与新基准测试,比较了不同建模策略,并证明在冻结的文本嵌套模型内进行模态融合效果最佳。尽管仅针对检索任务训练,该模型在语音意图检测等其他任务上表现出良好的泛化能力,表明其学习了通用语义表示。最后,我们分析了嵌套维度与层级间的成本-精度权衡,发现信息仅集中于少数分量中,这为效率提升提供了潜在可能。