Logical reasoning with large language models (LLMs) has received growing attention. One mainstream approach translates natural language into formal logic and then applies symbolic solvers for deduction. While effective in many tasks, these LLM-based translators often fail to generate consistent symbolic representations when the same concept appears in different linguistic forms. Such inconsistencies break logical coherence and lead to solver errors. However, most existing benchmarks lack this type of linguistic variation, which frequently occurs in real-world text, leaving the problem underexplored. To address this gap, we present SoLT, a benchmark that systematically rewrites reasoning datasets into diverse yet logically equivalent forms across multiple levels. Beyond evaluation, SoLT also provides a general method to enrich any dataset with linguistic diversity while preserving both meaning and logic. To further enhance the stability of LLM-based reasoning, we propose MenTaL, which explicitly guides models to build a concept-symbol mapping table during translation. By linking equivalent expressions to shared symbols, MenTaL maintains consistency and mitigates symbol drift. Experiments on SoLT demonstrate that LLMs indeed suffer from inconsistent symbol mapping under linguistic variation, leading to significant drops in reasoning accuracy. Meanwhile, applying MenTaL brings clear and stable performance improvements across diverse inputs. Overall, our findings reveal that overlooking linguistic diversity hides key weaknesses in LLM-based translators, and our work offers a step toward more reliable logical reasoning in varied real-world scenarios. Our code is available at https://github.com/wufeiwuwoshihua/LinguDiver.
翻译:基于大型语言模型(LLMs)的逻辑推理日益受到关注。主流方法之一是将自然语言转化为形式逻辑,再运用符号求解器进行演绎推理。尽管在许多任务中表现有效,但当同一概念以不同语言形式出现时,这些基于LLM的翻译器往往无法生成一致的符号表示。此类不一致性会破坏逻辑连贯性并导致求解器错误。然而,现有基准测试大多缺乏现实文本中频繁出现的此类语言变异,使得该问题未能得到充分探究。为填补这一空白,我们提出了SoLT基准——一个通过多层级系统化重写推理数据集、生成多样化但逻辑等价形式的评测框架。除评估功能外,SoLT还提供了一种通用方法,可在保持语义与逻辑不变的前提下为任意数据集注入语言多样性。为进一步增强基于LLM的推理稳定性,我们提出了MenTaL方法,该方法在翻译过程中显式引导模型构建概念-符号映射表。通过将等价表达式关联至共享符号,MenTaL保持了符号一致性并缓解了符号漂移问题。在SoLT上的实验表明,LLMs在语言变异下确实存在符号映射不一致问题,导致推理准确率显著下降。与此同时,应用MenTaL能在多样化输入中带来明确且稳定的性能提升。总体而言,我们的研究揭示:忽视语言多样性会掩盖基于LLM的翻译器的关键缺陷,而本工作为在多变现实场景中实现更可靠的逻辑推理迈出了重要一步。代码已开源:https://github.com/wufeiwuwoshihua/LinguDiver。