Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
翻译:常识推理通常涉及评估多种可能的解释,而非选择一个单一的原子答案,然而大多数基准测试依赖于单标签评估,这模糊了陈述是共同合理、相互排斥还是共同不合理。我们提出了LOGICAL-COMMONSENSEQA,这是一个将常识推理重新定义为使用合理性级别运算符(AND、OR、NEITHER/NOR)对原子陈述对进行逻辑组合的基准测试。通过零样本、少样本和思维链提示,对指令调优、推理专用和微调模型进行评估,我们发现,虽然模型在合取推理上表现尚可,在析取推理上表现中等,但在基于否定的问题上性能急剧下降。LOGICAL-COMMONSENSEQA揭示了根本的推理局限性,并为推进组合式常识推理提供了一个受控框架。