Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches require language-specific phonetic algorithms or fail to capture phonetic relationships across different scripts. This paper presents Symphonym, a neural embedding system that maps names from any script into a unified 128-dimensional phonetic space, enabling direct similarity comparison without runtime phonetic conversion. Symphonym uses a Teacher-Student architecture where a Teacher network trained on articulatory phonetic features produces target embeddings, while a Student network learns to approximate these embeddings directly from characters. The Teacher combines Epitran (extended with 100 new language-script mappings), Phonikud for Hebrew, and CharsiuG2P for Chinese, Japanese, and Korean. Training used 32.7 million triplet samples of toponyms spanning 20 writing systems from GeoNames, Wikidata, and Getty Thesaurus of Geographic Names. On the MEHDIE Hebrew-Arabic historical toponym benchmark, Symphonym achieves Recall@10 of 97.6% and MRR of 90.3%, outperforming Levenshtein and Jaro-Winkler baselines (Recall@1: 86.7% vs 81.5% and 78.5%). Evaluation on 12,947 real cross-script training pairs shows 82.6% achieve greater than 0.75 cosine similarity, with best performance on Arabic-Cyrillic (94--100%) and Cyrillic-Latin (94.3%) combinations. The fixed-length embeddings enable efficient retrieval in digital humanities workflows, with a case study on medieval personal names demonstrating effective transfer from modern place names to historical orthographic variation.

翻译：在数字人文与地理信息检索领域，跨历史文献、语言及书写系统的人名关联仍是一项基础性挑战。现有方法需依赖特定语言的语音算法，或难以捕捉不同文字间的语音关联。本文提出Symphonym——一种将任意文字的名称映射至统一128维语音空间的神经嵌入系统，无需运行时语音转换即可直接进行相似性比较。Symphonym采用师生架构：教师网络基于发音语音特征训练生成目标嵌入，学生网络则学习直接从字符逼近这些嵌入。教师网络整合了Epitran（扩展支持100种新语言-文字映射）、用于希伯来语的Phonikud，以及适用于中文、日文和韩文的CharsiuG2P。训练使用来自GeoNames、Wikidata和Getty地理名称词表的3270万组跨越20种书写系统的地名三元组样本。在MEHDIE希伯来-阿拉伯语历史地名基准测试中，Symphonym实现Recall@10达97.6%、MRR达90.3%，优于Levenshtein与Jaro-Winkler基线方法（Recall@1：86.7% vs 81.5%和78.5%）。对12,947组真实跨文字训练对的评估显示，82.6%的样本获得大于0.75的余弦相似度，其中阿拉伯-西里尔（94–100%）与西里尔-拉丁（94.3%）组合表现最佳。固定长度嵌入支持数字人文工作流中的高效检索，以中世纪人名为例的案例研究验证了该系统从现代地名到历史拼写变体的有效迁移能力。