Adapting large language models (LLMs) to low-resource domains remains challenging due to the scarcity of domain-specific data. While in-domain data is limited, there exists a vast amount of general-domain data that shares similar question-answer formats and reasoning patterns with domain tasks. This observation raises an important question: can useful general-domain data be mined to improve low-resource domain adaptation? Our initial findings show that general-domain chain-of-thought data contains useful auxiliary signals for domain adaptation, even without careful selection. This observation motivates a new paradigm for domain adaptation beyond exclusive reliance on domain-specific data. To systematically identify the most beneficial general-domain samples, we propose NTK-Selector, motivated by the Neural Tangent Kernel's ability to capture alignment in training dynamics. Since directly applying NTK to pretrained LLMs is impractical, we introduce a Jacobian-free NTK approximation and empirically demonstrate stable NTK-like behavior during fine-tuning. Extensive experiments across medical, financial, legal, and psychological domains demonstrate that NTK-Selector consistently outperforms domain-only fine-tuning and existing data selection baselines. In particular, NTK-Selector achieves gains of +8.7 and +5.1 points on Llama3-8B-Instruct and Qwen3-8B, respectively, compared to only +0.8 and +0.9 points from domain-only fine-tuning.
翻译:将大型语言模型(LLMs)适配到低资源领域仍然具有挑战性,原因在于领域特定数据的稀缺性。虽然领域内数据有限,但存在大量与领域任务共享相似问答格式和推理模式的通用领域数据。这一观察引发了一个重要问题:能否挖掘有用的通用领域数据来改进低资源领域适配?我们的初步发现表明,即使不经谨慎挑选,通用领域的思维链数据也包含对领域适配有用的辅助信号。这一观察催生了一种超越仅依赖领域特定数据的领域适配新范式。为了系统性地识别最具益处的通用领域样本,我们提出了NTK-Selector,其灵感来源于神经正切核捕捉训练动态中对齐性的能力。由于将NTK直接应用于预训练LLMs不可行,我们引入了一种无雅可比NTK近似方法,并在微调过程中实证展示了稳定的NTK类行为。在医学、金融、法律和心理学领域的大量实验表明,NTK-Selector始终优于仅限领域的微调和现有数据选择基线。特别地,NTK-Selector在Llama3-8B-Instruct和Qwen3-8B上分别取得了+8.7和+5.1个百分点的提升,而仅限领域的微调仅带来+0.8和+0.9个百分点的提升。