We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.
翻译:我们提出SONAR,一种新的多语言、多模态的固定大小句子嵌入空间。我们的单一文本编码器覆盖200种语言,在xsim和xsim++多语言相似性搜索任务上显著优于LASER3和LabSE等现有句子嵌入方法。通过基于教师-学生范式在语音转录数据上训练的语言特定语音编码器,语音片段可嵌入到相同的SONAR嵌入空间中。我们的编码器在相似性搜索任务上超越了现有语音编码器。我们还提供了覆盖200种语言的文本解码器,从而能够执行文本到文本及语音到文本的机器翻译,包括零样本语言与模态组合。尽管存在固定大小的瓶颈表示,我们的文本到文本翻译结果仍与当前最先进的NLLB~1B模型相当。在零样本语音到文本翻译结果上,我们也优于Whisper等强监督基线方法。