This paper sheds light on the limitations of Large Language Models (LLMs) by rigorously evaluating their ability to process masked text. We introduce two novel tasks: MskQA, measuring reasoning on masked question-answering datasets like RealtimeQA, and MskCal, assessing numerical reasoning on masked arithmetic problems.Testing GPT-4o and 4o-mini reveals that while LLMs exhibit some resilience to masked text, their performance is highly contingent on masking rates and semantic cues. Specifically, "solid masking," where semantic clues are entirely absent, leads to a significant performance drop compared to "partial lifting," where some semantic information is retained, indicating LLMs' reliance on surface-level patterns. Interestingly, GPT-4o consistently outperforms 4o-mini, particularly in MskCal, demonstrating a greater ability to handle numerical reasoning with masked text. This underscores the crucial role of semantic cues in the reasoning process of LLMs. Our study illuminates the interplay between background knowledge and reasoning ability in masked text processing, paving the way for a deeper understanding of LLM capabilities and limitations, and highlighting the need for more robust evaluation methods to accurately assess their true comprehension abilities.
翻译:本文通过严格评估大型语言模型(LLMs)处理掩码文本的能力,揭示了其局限性。我们引入了两项新颖任务:MskQA,用于衡量模型在如RealtimeQA等掩码问答数据集上的推理能力;MskCal,用于评估模型在掩码算术问题上的数值推理能力。对GPT-4o与4o-mini的测试表明,尽管LLMs对掩码文本展现出一定的适应能力,但其性能高度依赖于掩码率与语义线索。具体而言,在“完全掩码”(即语义线索完全缺失)的情况下,模型性能相比“部分保留”(即保留部分语义信息)显著下降,这表明LLMs对表层模式存在依赖。有趣的是,GPT-4o在各项任务中均持续优于4o-mini,尤其在MskCal任务中表现出更强的处理掩码文本数值推理的能力。这突显了语义线索在LLMs推理过程中的关键作用。本研究阐明了掩码文本处理中背景知识与推理能力之间的相互作用,为深入理解LLMs的能力与局限铺平了道路,并强调需要更稳健的评估方法来准确衡量其真实理解能力。