Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: https://github.com/changelinglab/prism.
翻译:音素识别(PR)作为跨语言语音处理和音素分析中语言无关建模的原子接口。尽管开发PR系统的努力持续已久,当前的评估仅衡量表层转录准确性。我们提出PRiSM,首个通过PR系统的内在与外在评估揭示音素感知盲点的开源基准。PRiSM标准化了基于转录的评估,并通过转录与表征探针评估其在临床、教育和多语言场景中的下游效用。研究发现,训练期间多样化的语言接触是PR性能的关键,编码器-CTC模型最为稳定,而专用PR模型仍优于大型音频语言模型。PRiSM发布了代码、训练方案和数据集,以推动该领域向具备鲁棒音素能力的多语言语音模型发展:https://github.com/changelinglab/prism。