Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.
翻译:大语言模型(LLMs)在众多自然语言处理任务中表现优异,但其生成显式语言结构的能力尚不明确。本文评估了指令微调大语言模型在标准阿拉伯语两项结构化预测任务上的表现:形态句法标注与带标签依存句法分析。阿拉伯语因其丰富的形态特征与拼写歧义性而构成一个具有挑战性的测试平台,这些特性形成了强烈的形态-句法互动。我们比较了零样本提示与基于检索的上下文学习(ICL)方法(使用阿拉伯语树库中的示例)。结果表明,提示设计与示例选择对性能有显著影响:在特征级标注任务中,闭源模型接近有监督基线水平,并在依存句法分析任务中与专用解析器形成竞争。在原始文本场景下,分词仍是挑战,但基于检索的上下文学习方法同时提升了句法分析与分词性能。我们的分析揭示了大语言模型对阿拉伯语形态句法与句法哪些方面能可靠捕捉,哪些方面仍存在困难。