Zero-shot commonsense Question-Answering (QA) requires models to reason about general situations beyond specific benchmarks. State-of-the-art approaches fine-tune language models on QA pairs constructed from CommonSense Knowledge Bases (CSKBs) to equip the models with more commonsense knowledge in a QA context. However, current QA synthesis protocols may introduce noise from the CSKBs and generate ungrammatical questions and false negative options, which impede the model's ability to generalize. To address these issues, we propose QADYNAMICS, a training dynamics-driven framework for QA diagnostics and refinement. Our approach analyzes the training dynamics of each QA pair at both the question level and option level, discarding machine-detectable artifacts by removing uninformative QA pairs and mislabeled or false-negative options. Extensive experiments demonstrate the effectiveness of our approach, which outperforms all baselines while using only 33% of the synthetic data, even including LLMs such as ChatGPT. Moreover, expert evaluations confirm that our framework significantly improves the quality of QA synthesis. Our codes and model checkpoints are available at https://github.com/HKUST-KnowComp/QaDynamics.
翻译:摘要:零样本常识问答(QA)要求模型能对超越特定基准的通用情境进行推理。现有主流方法通过在常识知识库构建的QA对上微调语言模型,使模型在问答语境中具备更多常识知识。然而,当前QA合成协议可能引入常识知识库的噪声,生成不合语法的问句及错误负例选项,进而削弱模型的泛化能力。针对上述问题,我们提出QADYNAMICS——一种基于训练动力学的QA诊断与优化框架。该方法从问句级别和选项级别分析每个QA对的训练动力学特征,通过剔除无效QA对及错误标注或虚假负例选项,消除机器可检测的伪影。大量实验表明,本方法即使仅使用33%的合成数据(涵盖ChatGPT等大语言模型),仍优于所有基线模型。此外,专家评估证实本框架显著提升了QA合成质量。相关代码与模型检查点已开源至https://github.com/HKUST-KnowComp/QaDynamics。