Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.
翻译:大语言模型将文本处理为子词标记序列,这种处理方式可能掩盖构成词汇形成的字符级和形态结构。对于非粘着型形态的语言而言,该局限性最为显著——标准分词系统会系统性地将词素边界与标记边界错位。我们提出PACUTE,一个包含4600项任务的诊断基准,旨在评估菲律宾语的形态理解能力。菲律宾语具有能产性中缀、重叠构词及变音驱动的词汇区分特征,这些特征在书面文本中通常缺失。PACUTE包含六级组合层次化诊断框架,可定位形态理解失效的具体环节。通过评估开源权重大语言模型及前沿商业模型发现:无论参数规模大小,开源模型在词素分解任务中表现近乎随机;前沿模型表现显著更优,在包含匹配评分下常能识别单个词缀,但其组合性词缀转换与音节划分任务仍远低于字符级天花板水平。这些结果表明,能产性形态组合(而非单纯的字符访问能力)才是菲律宾语词汇结构理解的持续性瓶颈。