Algorithm selection aims to identify the most suitable algorithm for solving a specific problem before execution, which has become a critical process of the AutoML. Current mainstream algorithm selection techniques rely heavily on feature representations of various problems and employ the performance of each algorithm as supervised information. However, there is a significant research gap concerning the consideration of algorithm features. This gap is primarily attributed to the inherent complexity of algorithms, making it particularly challenging to find a universally effective feature extraction method that is applicable across a diverse range of algorithms. Unfortunately, neglecting this aspect undoubtedly impacts the accuracy of algorithm selection and indirectly necessitates an increased volume of problem data for training purposes. This paper takes a significant stride towards addressing this gap by proposing an approach that integrates algorithm representation into the algorithm selection process. Specifically, our proposed model employs distinct modules to extract representations of both problems and algorithms, where the algorithm representation leverages the capabilities of pre-trained LLMs in the realm of code comprehension. Following the extraction of embedding vectors for both algorithms and problems, the most suitable algorithm is determined through calculations of matching degrees. Our experiments not only validate the effectiveness of the proposed model but also showcase the performance of different embedded pre-trained LLMs, which suggests that the proposed algorithm selection framework holds the potential to serve as a baseline task for evaluating the code representation capabilities of LLMs.
翻译:摘要:算法选择旨在解决特定问题前确定最适用的算法,这已成为自动机器学习(AutoML)的关键环节。当前主流算法选择技术高度依赖各类问题的特征表示,并以算法的性能作为监督信息。然而,在算法特征考量方面存在显著研究空白,这主要源于算法的内在复杂性,导致难以找到适用于多种算法且普遍有效的特征提取方法。遗憾的是,对此方面的忽视无疑会影响算法选择的准确性,并间接需要更多问题数据用于训练。本文通过提出一种将算法表征融入算法选择过程的方法,向填补这一空白迈出了重要一步。具体而言,我们的模型采用独立模块分别提取问题与算法的表征,其中算法表征利用了预训练大语言模型在代码理解领域的能力。在提取算法与问题的嵌入向量后,通过计算匹配度确定最适用的算法。实验不仅验证了所提出模型的有效性,还展示了不同嵌入预训练大语言模型的性能表现,表明该算法选择框架有望成为评估大语言模型代码表征能力的基准任务。