The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce $\textbf{D-SCoRE}$, a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.
翻译:高质量领域特定问答(QA)数据集的稀缺性和高成本限制了大语言模型(LLMs)的监督微调。我们提出了$\textbf{D-SCoRE}$,一种免训练框架,它利用LLMs和提示工程,从任意文本源自动生成多样化、富含思维链(CoT)的QA数据集。通过整合以$\textbf{D}$文档为中心的处理、$\textbf{S}$分割、$\textbf{Co}$T $\textbf{R}$easoning(思维链推理)和结构化$\textbf{E}$xport(导出)——以及语义角色转换、问题类型平衡和反事实增强等多维控制——D-SCoRE能够生成具有增强多样性和相关性的定制QA对。在大多数评估领域中,使用D-SCoRE生成的数据集进行微调的LLMs,其性能优于使用人工标注QA数据训练的模型。其高效性和可扩展性使得能够在消费级硬件上实现快速、高性能的领域自适应微调,端到端每GPU小时可生成超过1,100个高质量QA对。