Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.
翻译:大型语言模型(LLMs)在语义理解方面表现出色,但其从乱序输入中重建内部结构的能力仍未得到充分探索。句子级复原任务由于常存在多种有效词序,难以进行自动化评估。我们提出OrderProbe——一个基于中文、日文和韩文中固定四字表达式的确定性结构重建基准,这些表达式具有唯一的规范顺序,因而支持精确匹配评分。我们进一步提出一个诊断框架,从恢复准确率之外的多个维度评估模型性能,包括语义保真度、逻辑有效性、一致性、鲁棒敏感性及信息密度。在十二个广泛使用的LLM上的实验表明,即使对于前沿系统,结构重建任务仍具挑战性:零样本恢复准确率常低于35%。我们还观察到语义召回与结构规划之间存在系统性分离,表明结构鲁棒性并非语义能力的自动衍生属性。