Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions
翻译:近期语言模型的进展显著提升了自然语言理解能力。尽管广泛使用的基准测试表明大语言模型能够有效进行词义消歧,但它们在真实叙事语境中的实际应用性尚未得到充分探索。SemEval-2026任务5通过引入一项新任务来填补这一空白:预测短篇故事中词语义项的人类感知合理性。本文提出一种基于大语言模型的框架,采用结构化推理机制对叙事文本中同形异义词的义项进行合理性评分。我们考察了不同推理策略微调低参数大语言模型,以及针对大参数模型采用动态少样本提示方法对准确识别词义与合理性评估的影响。实验结果表明,结合动态少样本提示的商业大参数大语言模型能够高度模拟人类合理性判断。此外,模型集成方法可略微提升性能,相较于单一模型预测,能更好地模拟五名人类标注者的一致性模式。