The Winograd Schema Challenge (WSC) serves as a prominent benchmark for evaluating machine understanding. While Large Language Models (LLMs) excel at answering WSC questions, their ability to generate such questions remains less explored. In this work, we propose Tree-of-Experts (ToE), a novel prompting method which enhances the generation of WSC instances (50% valid cases vs. 10% in recent methods). Using this approach, we introduce WSC+, a novel dataset comprising 3,026 LLM-generated sentences. Notably, we extend the WSC framework by incorporating new 'ambiguous' and 'offensive' categories, providing a deeper insight into model overconfidence and bias. Our analysis reveals nuances in generation-evaluation consistency, suggesting that LLMs may not always outperform in evaluating their own generated questions when compared to those crafted by other models. On WSC+, GPT-4, the top-performing LLM, achieves an accuracy of 68.7%, significantly below the human benchmark of 95.1%.
翻译:维诺格拉德模式挑战(WSC)是评估机器理解能力的重要基准。尽管大型语言模型(LLM)在回答WSC问题时表现出色,但其生成此类问题的能力仍鲜有探索。本文提出专家树(Tree-of-Experts, ToE)这一新型提示方法,可有效提升WSC实例的生成质量(有效案例占比从近期方法的10%提升至50%)。基于该方法,我们构建了包含3,026条LLM生成句子的新数据集WSC+。特别地,我们通过引入新的“歧义”和“冒犯性”类别扩展了WSC框架,从而更深入地揭示模型过度自信与偏见问题。我们的分析揭示了生成-评估一致性中的细微差异:相较于其他模型生成的问题,LLM在评估自身生成的问题时未必表现更优。在WSC+上,表现最佳的LLM GPT-4仅达到68.7%的准确率,显著低于人类基准的95.1%。