Selectively processing noisy utterances while effectively disregarding speech-specific elements poses no considerable challenge for humans, as they exhibit remarkable cognitive abilities to separate semantically significant content from speech-specific noise (i.e. filled pauses, disfluencies, and restarts). These abilities may be driven by mechanisms based on acquired grammatical rules that compose abstract syntactic-semantic structures within utterances. Segments without syntactic and semantic significance are consistently disregarded in these structures. The structures, in tandem with lexis, likely underpin language comprehension and thus facilitate effective communication. In our study, grounded in linguistically motivated experiments, we investigate whether large language models (LLMs) can effectively perform analogical speech comprehension tasks. In particular, we examine the ability of LLMs to extract well-structured utterances from transcriptions of noisy dialogues. We conduct two evaluation experiments in the Polish language scenario, using a~dataset presumably unfamiliar to LLMs to mitigate the risk of data contamination. Our results show that not all extracted utterances are correctly structured, indicating that either LLMs do not fully acquire syntactic-semantic rules or they acquire them but cannot apply them effectively. We conclude that the ability of LLMs to comprehend noisy utterances is still relatively superficial compared to human proficiency in processing them.
翻译:人类在处理含噪语音时能够选择性地处理有效信息,同时有效忽略语音特有的元素(如填充停顿、不流利表达和重启现象),这并不构成显著挑战,因为他们展现出卓越的认知能力,能够将语义重要内容与语音特有噪声分离开来。这种能力可能源于基于习得语法规则的机制,这些规则在话语中构建抽象的句法-语义结构。在这些结构中,缺乏句法和语义意义的片段会被持续忽略。这些结构与词汇系统共同支撑语言理解,从而促进有效沟通。在本研究中,我们基于语言学驱动的实验,探究大型语言模型(LLMs)是否能有效执行类似的语音理解任务。具体而言,我们检验了LLMs从含噪对话转录中提取结构良好语句的能力。我们在波兰语场景下进行了两项评估实验,使用LLMs可能不熟悉的数据集以降低数据污染风险。实验结果表明,并非所有提取的语句都具有正确结构,这说明LLMs要么未能完全掌握句法-语义规则,要么虽已掌握但无法有效应用。我们得出结论:与人类处理含噪语音的熟练程度相比,LLMs理解含噪语音的能力仍相对浅层。