LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.

翻译：随着大语言模型（LLMs）的快速发展，高资源语言（如英语、中文）的性能已接近饱和，而低资源语言（如乌尔都语、泰语）由于训练数据有限、机器翻译噪声以及跨语言对齐不稳定，其性能仍显著落后。本文提出LiRA（面向大语言模型的语言鲁棒锚定框架），这是一种在低资源条件下鲁棒地提升跨语言表示能力，并同时增强检索与推理的训练框架。LiRA包含两个模块：（i）Arca（锚定表示组合架构），通过基于锚点的对齐机制与多智能体协同编码，将低资源语言锚定至英语语义空间，在共享嵌入空间中保持几何稳定性；（ii）LaSR（语言耦合语义推理器），在Arca的多语言表示之上增加一个语言感知的轻量级推理头，并施加一致性正则化，通过统一的训练目标增强跨语言理解、检索与推理的鲁棒性。我们进一步构建并发布了一个涵盖五种东南亚语言和两种南亚语言的多语言产品检索数据集。在低资源基准测试（跨语言检索、语义相似度与推理）上的实验表明，该框架在少样本和噪声放大设置下均取得了一致的性能提升并展现出鲁棒性；消融实验验证了Arca与LaSR模块各自的贡献。代码将在GitHub上发布，数据集将发布于Hugging Face。