Humans can utilize techniques to quickly acquire knowledge from specific materials in advance, such as creating self-assessment questions, enabling us to achieving related tasks more efficiently. In contrast, large language models (LLMs) usually relies on retrieval-augmented generation to exploit knowledge materials in an instant manner, or requires external signals such as human preference data and stronger LLM annotations to conduct knowledge adaptation. To unleash the self-learning potential of LLMs, we propose KBAda, an approach designed for efficient adaptation to downstream tasks involving knowledge bases. Our method utilizes iterative training with self-annotated data such as Q&A pairs and revision suggestions, enabling the model to grasp the knowledge content efficiently. Experimental results on multiple datasets demonstrate the effectiveness of our approach, significantly boosting model performance in downstream tasks that require specific knowledge at a low cost. Notably, our approach achieves over 90% of the performance improvement that can be obtained by using GPT-4-turbo annotation, while relying entirely on self-supervision. We release our experimental data, models, and process analyses to the community for further exploration (https://github.com/thunlp/KBAda).
翻译:人类能够预先利用特定材料快速获取知识,例如通过创建自测问题,从而更高效地完成相关任务。相比之下,大语言模型通常依赖检索增强生成技术来即时利用知识材料,或需要借助人类偏好数据和更强LLM标注等外部信号进行知识适应。为释放大语言模型的自主学习潜力,我们提出KBAda——一种面向知识库下游任务的高效自适应方法。该方法通过问答对与修订建议等自标注数据进行迭代训练,使模型能高效掌握知识内容。在多数据集上的实验结果表明,本方法能以较低成本显著提升模型在需要特定知识的下游任务中的性能。值得注意的是,该方法完全依赖自监督学习,即可达到使用GPT-4-turbo标注所能获得性能提升的90%以上。我们将实验数据、模型及过程分析开源供学界进一步探索(https://github.com/thunlp/KBAda)。