Large pre-trained language models (PLMs) have garnered significant attention for their versatility and potential for solving a wide spectrum of natural language processing (NLP) tasks. However, the cost of running these PLMs may be prohibitive. Furthermore, PLMs may not be open-sourced due to commercial considerations and potential risks of misuse, such as GPT-3. The parameters and gradients of PLMs are unavailable in this scenario. To solve the issue, black-box tuning has been proposed, which utilizes derivative-free optimization (DFO), instead of gradient descent, for training task-specific continuous prompts. However, these gradient-free methods still exhibit a significant gap compared to gradient-based methods. In this paper, we introduce gradient descent into black-box tuning scenario through knowledge distillation. Furthermore, we propose a novel method GDFO, which integrates gradient descent and derivative-free optimization to optimize task-specific continuous prompts in a harmonized manner. Experimental results show that GDFO can achieve significant performance gains over previous state-of-the-art methods.
翻译:大型预训练语言模型因其通用性和解决广泛自然语言处理任务的潜力而备受关注。然而,运行这些大型预训练语言模型的成本可能高得令人望而却步。此外,由于商业考量及潜在的滥用风险(例如GPT-3),大型预训练语言模型可能不会开源。在此场景下,模型的参数和梯度均不可用。为解决此问题,黑盒调优应运而生,它利用无导数优化而非梯度下降来训练任务特定的连续提示。然而,这些无梯度方法相较于基于梯度的方法仍存在显著差距。本文通过知识蒸馏将梯度下降引入黑盒调优场景。此外,我们提出了一种新颖的方法GDFO,该方法将梯度下降与无导数优化相结合,以协调优化的方式优化任务特定的连续提示。实验结果表明,GDFO相较于以往最先进的方法能实现显著的性能提升。