Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
翻译:常识推理通常涉及评估多种合理的解释,而非选择单一的原子答案,然而大多数基准依赖于单标签评估,这模糊了陈述是共同合理、相互排斥还是共同不合理。我们引入了LOGICAL-COMMONSENSEQA,这是一个通过使用合理性级别运算符(AND、OR、NEITHER/NOR)对原子陈述对进行逻辑组合来重新构建常识推理的基准。在零样本、少样本和思维链提示下评估指令调优、推理专用和微调模型,我们发现,尽管模型在合取推理上表现尚可,在析取推理上表现中等,但在基于否定的问题上性能急剧下降。LOGICAL-COMMONSENSEQA揭示了根本的推理局限性,并为推进组合式常识推理提供了一个受控框架。