Domain-specific large language models (LLMs), typically developed by fine-tuning a pre-trained general-purpose LLM on specialized datasets, represent a significant advancement in applied AI. A common strategy in LLM fine-tuning is curriculum learning, which pre-orders training samples based on metrics like difficulty to improve learning efficiency compared to a random sampling strategy. However, most existing methods for LLM fine-tuning rely on a static curriculum, designed prior to training, which lacks adaptability to the model's evolving needs during fine-tuning. To address this, we propose EDCO, a novel framework based on two key concepts: inference entropy and dynamic curriculum orchestration. Inspired by recent findings that maintaining high answer entropy benefits long-term reasoning gains, EDCO prioritizes samples with high inference entropy in a continuously adapted curriculum. EDCO integrates three core components: an efficient entropy estimator that uses prefix tokens to approximate full-sequence entropy, an entropy-based curriculum generator that selects data points with the highest inference entropy, and an LLM trainer that optimizes the model on the selected curriculum. Comprehensive experiments in communication, medicine and law domains, EDCO outperforms traditional curriculum strategies for fine-tuning Qwen3-4B and Llama3.2-3B models under supervised and reinforcement learning settings. Furthermore, the proposed efficient entropy estimation reduces computational time by 83.5% while maintaining high accuracy.
翻译:领域特定大语言模型(LLMs)通常通过在专业数据集上对预训练的通用大语言模型进行微调而开发,代表了应用人工智能领域的重大进展。大语言模型微调中的一个常见策略是课程学习,即根据难度等指标对训练样本进行预排序,相比随机采样策略,该方法能提高学习效率。然而,现有的大语言模型微调方法大多依赖于静态课程,此类课程在训练前设计,无法适应微调过程中模型不断变化的需求。为解决这一问题,我们提出了EDCO,一个基于两个关键概念的新框架:推理熵与动态课程编排。受近期研究发现——保持高答案熵有利于长期推理收益——的启发,EDCO在持续调整的课程中优先选择具有高推理熵的样本。EDCO集成了三个核心组件:一个使用前缀词元近似全序列熵的高效熵估计器,一个基于熵的课程生成器(用于选择具有最高推理熵的数据点),以及一个在选定课程上优化模型的大语言模型训练器。在通信、医学和法律领域的综合实验中,EDCO在监督学习和强化学习设置下,对Qwen3-4B和Llama3.2-3B模型进行微调时,其性能均优于传统课程策略。此外,所提出的高效熵估计方法在保持高精度的同时,将计算时间减少了83.5%。