Recent studies have shown that maintaining a consistent response style by human experts and enhancing data quality in training sets can significantly improve the performance of fine-tuned Large Language Models (LLMs) while reducing the number of training examples needed. However, the precise definition of style and the relationship between style, data quality, and LLM performance remains unclear. This research decomposes response style into presentation and composition styles and finds that, among training data of similar quality, those with higher style consistency lead to better LLM performance. Inspired by this, we introduce Style Consistency-Aware Response Ranking (SCAR), which automatically prioritizes instruction-response pairs in the training set based on their response stylistic consistency. By selecting the most style-consistent examples, ranging from the top 25% to 0.7% of the full dataset, the fine-tuned LLMs can match or even surpass the performance of models trained on the entire dataset in coding and open-ended question-answering benchmarks. Code and data are available at https://github.com/zhuang-li/SCAR .
翻译:近期研究表明,在训练集中保持由人类专家提供的一致性响应风格并提升数据质量,能够显著提高微调后大语言模型的性能,同时减少所需训练样本数量。然而,风格的准确定义,以及风格、数据质量与LLM性能之间的关系尚不明确。本研究将响应风格分解为呈现风格与构成风格,并发现:在质量相近的训练数据中,具有更高风格一致性的数据能带来更优的LLM性能。受此启发,我们提出了风格一致性感知响应排序方法,该方法能依据指令-响应对的响应风格一致性,自动对训练集中的样本进行优先级排序。通过选取风格一致性最高的样本(从全数据集的Top 25%至0.7%),微调后的LLM在代码生成与开放式问答基准测试中,能够达到甚至超越使用完整数据集训练模型的性能。代码与数据公开于https://github.com/zhuang-li/SCAR。