Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.
翻译:现代语言模型(LMs)日益需要两种关键资源:计算资源和数据资源。数据选择技术可以有效减少微调LMs所需的训练数据量。然而,其有效性往往与计算资源密切相关,通常需要高昂的计算预算。鉴于实际微调场景中的资源限制,我们系统地揭示了数据选择与所选数据不确定性估计之间的关系。尽管大语言模型(LLMs)在语言理解和生成方面展现出卓越能力,为缓解数据稀缺问题提供了新途径,但评估数据的可用性仍是一项具有挑战性的任务。这使得高效的数据选择不可或缺。为缓解这些问题,我们提出了基于熵的无监督数据选择(EUDS)框架。在情感分析(SA)、主题分类(Topic-CLS)和问答(Q&A)任务上的实证实验验证了其有效性。EUDS建立了一种计算高效的数据过滤机制。理论分析和实验结果均证实了该方法的有效性。EUDS在减少数据需求的同时,显著降低了计算成本并提升了训练时间效率。这为计算受限场景下语言模型的高效微调提供了创新性解决方案。