Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches require language-specific phonetic algorithms or fail to capture phonetic relationships across different scripts. This paper presents Symphonym, a neural embedding system that maps names from any script into a unified 128-dimensional phonetic space, enabling direct similarity comparison without runtime phonetic conversion. Symphonym uses a Teacher-Student architecture where a Teacher network trained on articulatory phonetic features produces target embeddings, while a Student network learns to approximate these embeddings directly from characters. The Teacher combines Epitran (extended with 100 new language-script mappings), Phonikud for Hebrew, and CharsiuG2P for Chinese, Japanese, and Korean. Training used 32.7 million triplet samples of toponyms spanning 20 writing systems from GeoNames, Wikidata, and Getty Thesaurus of Geographic Names. On the MEHDIE Hebrew-Arabic historical toponym benchmark, Symphonym achieves Recall@10 of 97.6% and MRR of 90.3%, outperforming Levenshtein and Jaro-Winkler baselines (Recall@1: 86.7% vs 81.5% and 78.5%). Evaluation on 12,947 real cross-script training pairs shows 82.6% achieve greater than 0.75 cosine similarity, with best performance on Arabic-Cyrillic (94--100%) and Cyrillic-Latin (94.3%) combinations. The fixed-length embeddings enable efficient retrieval in digital humanities workflows, with a case study on medieval personal names demonstrating effective transfer from modern place names to historical orthographic variation.

翻译：在数字人文与地理信息检索领域，跨历史文献、语言及书写系统的人名关联始终是一项基础性挑战。现有方法需依赖特定语言的语音算法，或难以捕捉不同文字间的语音关联。本文提出Symphonym——一种将任意文字的名称映射至统一128维语音空间的神经嵌入系统，无需运行时语音转换即可直接进行相似性比较。该系统采用师生架构：教师网络基于发音语音特征训练生成目标嵌入，学生网络则学习直接从字符逼近这些嵌入。教师网络整合了Epitran（扩展支持100种新语言-文字映射）、用于希伯来语的Phonikud，以及适用于中文、日文和韩文的CharsiuG2P。训练使用来自GeoNames、Wikidata和Getty地理名称词表的3270万组地名三元样本，涵盖20种书写系统。在MEHDIE希伯来-阿拉伯语历史地名基准测试中，Symphonym的Recall@10达到97.6%，MRR为90.3%，优于Levenshtein与Jaro-Winkler基线方法（Recall@1：86.7%对比81.5%与78.5%）。在12,947组真实跨文字训练对上评估显示，82.6%的样本获得大于0.75的余弦相似度，其中阿拉伯-西里尔文字组合（94–100%）与西里尔-拉丁文字组合（94.3%）表现最佳。固定长度嵌入支持数字人文工作流中的高效检索，通过中世纪人名的案例研究，验证了该系统从现代地名到历史拼写变体的有效迁移能力。