Effective retrieval-augmented generation across bilingual Greek--English applications requires embedding models capable of capturing both domain-specific semantic relationships and cross-lingual semantic alignment. Existing multilingual embedding models distribute their representational capacity across numerous languages, limiting their optimization for Greek and failing to encode the morphological complexity and domain-specific terminological structures inherent in Greek text. In this work, we propose ORPHEAS, a specialized Greek--English embedding model for bilingual retrieval-augmented generation. ORPHEAS is trained with a high quality dataset generated by a knowledge graph-based fine-tuning methodology which is applied to a diverse multi-domain corpus, which enables language-agnostic semantic representations. The numerical experiments across monolingual and cross-lingual retrieval benchmarks reveal that ORPHEAS outperforms state-of-the-art multilingual embedding models, demonstrating that domain-specialized fine-tuning on morphologically complex languages does not compromise cross-lingual retrieval capability.
翻译:有效的双语言希腊语-英语检索增强生成应用需要嵌入模型既能捕捉领域特定的语义关系,又能实现跨语言的语义对齐。现有的多语言嵌入模型将表示能力分散在众多语言中,限制了其对希腊语的优化,且无法编码希腊文本中固有的形态复杂性与领域特定术语结构。本文提出ORPHEAS,一种面向双语检索增强生成的专用希腊语-英语嵌入模型。ORPHEAS基于知识图谱微调方法生成的高质量数据集进行训练,该方法应用于多样化的多领域语料库,从而实现了语言无关的语义表示。在单语言和跨语言检索基准上的数值实验表明,ORPHEAS的性能优于现有最先进的多语言嵌入模型,证明对形态复杂语言进行领域专用微调并不会削弱跨语言检索能力。