Symphonym: Universal Phonetic Embeddings for Cross-Script Toponym Matching via Teacher-Student Distillation

Linking place names across languages and writing systems is a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches rely on language-specific phonetic algorithms or transliteration rules that fail when names cross script boundaries -- no string metric can determine that "Moscow" when rendered in Cyrillic or Arabic refer to the same city. I present Symphonym, a neural embedding system that maps toponyms from 20 writing systems into a unified 128-dimensional phonetic space. A Teacher network trained on articulatory phonetic features (via Epitran and PanPhon) produces target embeddings, while a Student network learns to approximate these from raw characters. At inference, only the lightweight Student (1.7M parameters) is required, enabling deployment without runtime phonetic conversion. Training uses a three-phase curriculum on 57 million toponyms from GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names. Phase 1 trains the Teacher on 467K phonetically-grounded triplets. Phase 2 aligns the Student to Teacher outputs across 23M samples, achieving 96.6% cosine similarity. Phase 3 fine-tunes on 3.3M hard negative triplets -- negatives sharing prefix and script with the anchor but referring to different places -- to sharpen discrimination. Evaluation on the MEHDIE Hebrew-Arabic benchmark achieves 89.2% Recall@1, outperforming Levenshtein (81.5%) and Jaro-Winkler (78.5%). The system is optimised for cross-script matching; same-script variants can be handled by complementary string methods. Symphonym will enable fuzzy phonetic reconciliation and search across the World Historical Gazetteer's 67 million toponyms. Code and models are publicly available.

翻译：跨语言与跨文字系统的地名关联是数字人文与地理信息检索领域的核心挑战。现有方法依赖于语言特定的语音算法或音译规则，当名称跨越文字边界时即告失效——没有任何字符串度量能够判定西里尔文或阿拉伯文书写的“Moscow”指向同一城市。本文提出Symphonym，一种将来自20种文字系统的地名映射到统一128维语音空间的神经嵌入系统。教师网络基于发音语音特征（通过Epitran与PanPhon工具）训练生成目标嵌入，而学生网络则学习从原始字符逼近这些嵌入。在推理阶段仅需轻量级的学生网络（170万参数），无需运行时语音转换即可部署。训练采用三阶段课程学习策略，数据源涵盖GeoNames、Wikidata与盖蒂地理名称词典的5700万个地名。第一阶段基于46.7万个语音锚定三元组训练教师网络。第二阶段通过2300万样本使学生网络输出与教师网络对齐，达到96.6%的余弦相似度。第三阶段使用330万个困难负例三元组进行微调——这些负例与锚点地名共享前缀和文字但指向不同地理位置——以提升判别能力。在MEHDIE希伯来语-阿拉伯语基准测试中取得89.2%的Recall@1，优于莱文斯坦距离（81.5%）与Jaro-Winkler算法（78.5%）。本系统专为跨文字匹配优化；同文字变体可通过互补的字符串方法处理。Symphonym将为世界历史地名录的6700万个地名实现模糊语音对齐与跨文字检索。代码与模型已公开发布。