We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.
翻译:我们提出了BaziQA-Benchmark,这是一个用于评估大型语言模型中符号与时间组合推理能力的标准化基准。该基准源自全球算命师大赛(2021-2025)中200道专业编制的选择题,其中每个问题都需要基于固定的符号图表和交互的时间条件进行结构化推理。与轶事性或提示驱动的评估不同,BaziQA-Benchmark支持跨年份、领域和模型系列的客观评分与受控比较。我们在多轮设置下评估了当代语言模型,并分析了其在时间难度、推理领域和推理协议上的性能差异。为了进一步探究推理行为,我们引入了一种轻量级的结构化推理协议,该协议在不增加领域知识的情况下约束推理顺序。结果表明,模型的表现始终优于随机猜测,但远未达到饱和水平,对时间组合和推理顺序表现出明显的敏感性,并且在精确时间定位和多条件符号判断上存在系统性失败。