Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.
翻译:大型语言模型在标准化逻辑推理基准测试中表现日益出色,但其能力是否能在英语之外的语言中保持鲁棒性尚不明确。我们提出ChLogic——一个英中双语对齐的基准测试,用于检验当相同的潜在逻辑结构分别以英语和多样化中文表层形式表达时,模型能否保持逻辑推理性能。该基准基于形式逻辑模板构建,包含三个数据集:(i) 通用对齐集,源于60条跨九个模板家族的通用命题;(ii) 困难对齐集,源于40个难题;(iii) 仅中文集,涵盖15类语言特异性现象类型。每个对齐项包含一条英文参照表达式与五种中文实现形式。针对Qwen3、Ministral及GLM模型的实验揭示出持续的英中性能差距。从标准中文回译至英文通常能提升通用对齐集的性能,但对困难对齐集产生混合效果——Qwen3-32B与GLM-5.1在翻译后性能反而下降。这些结果表明,中文表层实现、翻译伪迹以及模型特定行为共同影响多语言逻辑推理。总体而言,ChLogic为多语言推理的鲁棒性提供了有效的压力测试。