The SemEval 2024 BRAINTEASER task challenges language models to perform lateral thinking -- a form of creative, non-linear reasoning that remains underexplored in NLP. The task comprises two subtasks, Sentence Puzzle and Word Puzzle, requiring models to defy conventional commonsense associations. We present a system that fine-tunes DeBERTaV3 using HuggingFace's AutoModelForMultipleChoice architecture. We augment the provided training data with two additional sources: (1) a humor-style question-answering dataset generated via GPT-4 prompting, and (2) the RiddleSense dataset. This data augmentation strategy is motivated by the observation that humor and riddles share the lateral reasoning structure required by the task. Our best system achieves 92.5\% overall accuracy on the Sentence Puzzle subtask and 80.2\% on the Word Puzzle subtask, ranking 6th out of 31 teams and 10th out of 23 teams, respectively. We further show that the choice of task formulation matters: framing the problem as multiple-choice rather than sequence classification yields a 10-point accuracy improvement with the same base model. Our analysis reveals that data augmentation with humor and riddle data is particularly effective for sentence-level lateral reasoning, while word-level puzzles remain a harder challenge.
翻译:SemEval 2024的BRAINTEASER任务挑战语言模型执行横向思维——一种在自然语言处理领域尚未充分探索的创造性非线性推理形式。该任务包含句子谜题和词语谜题两个子任务,要求模型突破常规的常识性关联。我们提出了一种基于HuggingFace的AutoModelForMultipleChoice架构对DeBERTaV3进行微调的系统。我们在提供的训练数据基础上增加了两个额外数据源:(1)通过GPT-4提示生成的幽默风格问答数据集;(2)RiddleSense数据集。这种数据增强策略的动机在于,幽默和谜语具有与该任务所需的横向推理结构相似的特征。我们的最佳系统在句子谜题子任务上达到92.5%的整体准确率,在词语谜题子任务上达到80.2%,分别在31支参赛队伍中排名第6位,在23支队伍中排名第10位。我们进一步证明任务表述形式的选择至关重要:将问题构建为多项选择题而非序列分类任务,在使用相同基础模型的情况下可带来10个百分点的准确率提升。分析表明,幽默与谜语数据增强对句子层级的横向推理特别有效,而词语层级的谜题仍是更具挑战性的难题。