Large Language Model-Enhanced Algorithm Selection: Towards Comprehensive Algorithm Representation

Algorithm selection aims to identify the most suitable algorithm for solving a specific problem before execution, which has become a critical process of the AutoML. Current mainstream algorithm selection techniques rely heavily on feature representations of various problems and employ the performance of each algorithm as supervised information. However, there is a significant research gap concerning the consideration of algorithm features. This gap is primarily attributed to the inherent complexity of algorithms, making it particularly challenging to find a universally effective feature extraction method that is applicable across a diverse range of algorithms. Unfortunately, neglecting this aspect undoubtedly impacts the accuracy of algorithm selection and indirectly necessitates an increased volume of problem data for training purposes. This paper takes a significant stride towards addressing this gap by proposing an approach that integrates algorithm representation into the algorithm selection process. Specifically, our proposed model employs distinct modules to extract representations of both problems and algorithms, where the algorithm representation leverages the capabilities of pre-trained LLMs in the realm of code comprehension. Following the extraction of embedding vectors for both algorithms and problems, the most suitable algorithm is determined through calculations of matching degrees. Our experiments not only validate the effectiveness of the proposed model but also showcase the performance of different embedded pre-trained LLMs, which suggests that the proposed algorithm selection framework holds the potential to serve as a baseline task for evaluating the code representation capabilities of LLMs.

翻译：算法选择旨在在执行前识别出最适合解决特定问题的算法，这已成为自动化机器学习（AutoML）的一个关键过程。当前主流的算法选择技术主要依赖于各种问题的特征表示，并将每个算法的性能作为监督信息。然而，关于算法特征的研究存在显著空白。这一空白主要归因于算法固有的复杂性，使得找到一种普遍适用的特征提取方法以适用于各类算法尤为困难。遗憾的是，忽略这一方面无疑会影响算法选择的准确性，并间接需要更多的问题数据进行训练。本文通过提出一种将算法表征整合到算法选择过程中的方法，朝填补这一空白迈出了重要一步。具体而言，我们提出的模型采用不同模块分别提取问题和算法的表征，其中算法表征利用了预训练大型语言模型（LLMs）在代码理解领域的能力。在提取算法和问题的嵌入向量后，通过计算匹配度来确定最合适的算法。我们的实验不仅验证了所提模型的有效性，还展示了不同嵌入预训练LLMs的性能，这表明所提出的算法选择框架有潜力作为评估LLMs代码表征能力的基线任务。