Zero-shot cross-lingual phoneme recognition is often hindered by the fragility of direct acoustic-to-symbol mapping, which is susceptible to language-specific variations. Echoing joint-embedding predictive architecture (JEPA) work in vision, we propose ArtNet, a framework that explores a structured feature prediction task based on articulatory features to enhance acoustic robustness. Specifically, ArtNet integrates an articulatory predictor, designed to extract universal articulatory representations from self-supervised learning (SSL) features, with a variational information bottleneck (VIB) to suppress language-specific variations. Experiments on seven unseen languages demonstrate that ArtNet, particularly when synergized with the proposed vector-space inventory alignment (VSIA) strategy, significantly outperforms competitive baselines, achieving a 20.56\% relative reduction in phoneme error rate (PER) and 7.01\% in phoneme feature error rate (PFER).
翻译:零样本跨语言音素识别常因直接声学到符号映射的脆弱性而受阻,这种映射易受语言特异性变化影响。借鉴视觉领域中的联合嵌入预测架构(JEPA)工作,我们提出ArtNet框架,该框架探索基于发音特征的机构化特征预测任务以增强声学鲁棒性。具体而言,ArtNet集成了一个发音预测器(用于从自监督学习特征中提取通用发音表征)与变分信息瓶颈(VIB)以抑制语言特异性变化。在七种未见语言上的实验表明,ArtNet在协同所提出的向量空间清单对齐(VSIA)策略时显著优于竞争基线,实现了音素错误率(PER)20.56%和音素特征错误率(PFER)7.01%的相对降低。