We explore the ability of large language models to solve and generate puzzles from the NPR Sunday Puzzle game show using PUZZLEQA, a dataset comprising 15 years of on-air puzzles. We evaluate four large language models using PUZZLEQA, in both multiple choice and free response formats, and explore two prompt engineering techniques to improve free response performance: chain-of-thought reasoning and prompt summarization. We find that state-of-the-art large language models can solve many PUZZLEQA puzzles: the best model, GPT-3.5, achieves 50.2% loose accuracy. However, in our few-shot puzzle generation experiment, we find no evidence that models can generate puzzles: GPT-3.5 generates puzzles with answers that do not conform to the generated rules. Puzzle generation remains a challenging task for future work.
翻译:我们利用包含15年电台谜题数据的PUZZLEQA数据集,探究大语言模型求解与生成NPR周日猜谜秀谜题的能力。通过PUZZLEQA数据集,我们在选择题和自由回答两种格式下评估了四种大语言模型,并探索了两种提升自由回答表现的提示工程技术:思维链推理与提示摘要。研究发现,当前最先进的大语言模型能解决多个PUZZLEQA谜题:最优模型GPT-3.5达到50.2%的宽松准确率。然而在少样本谜题生成实验中,未发现模型具备生成谜题的能力:GPT-3.5生成的谜题答案不符合自创规则。谜题生成仍是未来研究面临的挑战性任务。