Large language models (LLMs) have demonstrated remarkable effectiveness in text reranking through works like RankGPT, leveraging their human-like reasoning about relevance. However, supervised fine-tuning for ranking often diminishes these models' general-purpose capabilities, including the crucial reasoning abilities that make them valuable for ranking. We introduce a novel approach integrating Chain-of-Thought prompting with an SFT-DPO (Supervised Fine-Tuning followed by Direct Preference Optimization) pipeline to preserve these capabilities while improving ranking performance. Our experiments on TREC 2019 and 2020 Deep Learning datasets show that our approach outperforms the state-of-the-art RankZephyr while maintaining strong performance on the Massive Multitask Language Understanding (MMLU) benchmark, demonstrating effective preservation of general-purpose capabilities through thoughtful fine-tuning strategies. Our code and data will be publicly released upon the acceptance of the paper.
翻译:大型语言模型(LLM)通过RankGPT等工作在文本重排序任务中展现出卓越效能,其关键在于能够进行类人的相关性推理。然而,针对排序任务的监督微调往往会削弱模型的通用能力,包括对排序至关重要的推理能力。本文提出一种创新方法,将思维链提示与SFT-DPO(监督微调后接直接偏好优化)流程相结合,在提升排序性能的同时保持模型的通用能力。我们在TREC 2019和2020深度学习数据集上的实验表明,该方法在超越当前最优模型RankZephyr的同时,于大规模多任务语言理解(MMLU)基准测试中保持强劲性能,这证明了通过精细设计的微调策略能有效保留模型的通用能力。本研究的代码与数据将在论文录用后公开发布。