Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs' phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs' genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs' phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs' phonological understanding and "perception", highlighting an underexplored frontier for future research.
翻译:语言是思想的载体,与声音、符号和意义紧密交织。然而,大多数大语言模型(LLM)研究聚焦于意义(语义)和符号(拼写),却很大程度上忽略了声音。现有针对LLM语音能力的基准测试要么可通过死记硬背解决,要么与其他能力混杂,难以衡量LLM在语音理解上的真实能力。为此,我们提出Phun-Bench——一个专门构建的中文基准测试,涵盖三个维度(同音、押韵和语音相似性)下的多样化任务与设置,旨在系统评估LLM的语音理解能力。结果表明,尽管LLM在回忆正确发音上表现出色,但它们普遍难以像人类说话者那样以灵活、直觉的方式运用语音知识。此外,通过详细分析,我们提出了关于LLM语音理解与“感知”潜在机制的一个假说,为未来研究揭示了一个尚未充分探索的前沿。