Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (<4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen "Fool Me If You Can'' dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.
翻译:词义消歧(WSD)仍然是自然语言处理(NLP)中的一个关键挑战,尤其是在处理常被误解的罕见或领域特定词义时。尽管现代高参数大语言模型(LLMs)(如GPT-4-Turbo)已展现出最先进的WSD性能,但其计算和能耗需求限制了可扩展性。本研究探讨了低参数LLMs(<40亿参数)是否能够通过强调推理驱动词义识别的微调策略,实现可比的结果。利用通过半自动化、富含推理依据的标注增强的FEWS数据集,我们对八个小型开源LLMs(例如Gemma和Qwen)进行了微调。我们的结果表明,基于思维链(CoT)的推理结合相邻词分析,在零样本设置下达到了与GPT-4-Turbo相当的性能。重要的是,Gemma-3-4B和Qwen-3-4B模型在FEWS数据集上始终优于所有中等参数基线模型和当前最先进的模型,并对未见词义表现出稳健的泛化能力。此外,在未见过的“Fool Me If You Can”数据集上的评估证实了其强大的跨领域适应性,而无需进行任务特定的微调。这项工作表明,通过精心设计的以推理为中心的微调,低参数LLMs能够在显著降低计算和能耗需求的同时,提供准确的词义消歧。