Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.
翻译:尽管主要基于单语数据进行预训练,多语言大语言模型仍展现出卓越的跨语言性能。虽然预训练语料库中的双语数据被普遍认为是实现这些能力的关键,但其具体贡献机制尚不明确。本研究通过在受控条件下从头预训练模型来探究此问题,将标准网络语料库与移除所有多语言文档的纯单语版本进行对比。尽管双语数据仅占语料库的2%,移除后仍导致翻译性能在BLEU指标上下降56%,而跨语言问答和通用推理任务的表现保持稳定,其训练曲线与基线模型基本重合。为理解这种不对称性,我们根据不同语言内容的语义关联性将双语数据分类为平行语料(14%)、语码转换数据(72%)和其他文档(14%)。随后通过向纯单语语料库重新引入平行语料或语码转换数据进行细粒度消融实验。实验表明:平行语料几乎完全恢复翻译性能(达到未过滤基线的91%),而语码转换数据的贡献微乎其微;其他跨语言任务的表现基本不受任一类双语数据影响。这些发现揭示:翻译能力关键依赖于平行语料提供的系统性词元级对齐,而跨语言理解与推理能力即使在没有双语数据的情况下亦可实现。