When can you trust feature selection? -- I: A condition-based analysis of LASSO and generalised hardness of approximation

The arrival of AI techniques in computations, with the potential for hallucinations and non-robustness, has made trustworthiness of algorithms a focal point. However, trustworthiness of the many classical approaches are not well understood. This is the case for feature selection, a classical problem in the sciences, statistics, machine learning etc. Here, the LASSO optimisation problem is standard. Despite its widespread use, it has not been established when the output of algorithms attempting to compute support sets of minimisers of LASSO in order to do feature selection can be trusted. In this paper we establish how no (randomised) algorithm that works on all inputs can determine the correct support sets (with probability $> 1/2$) of minimisers of LASSO when reading approximate input, regardless of precision and computing power. However, we define a LASSO condition number and design an efficient algorithm for computing these support sets provided the input data is well-posed (has finite condition number) in time polynomial in the dimensions and logarithm of the condition number. For ill-posed inputs the algorithm runs forever, hence, it will never produce a wrong answer. Furthermore, the algorithm computes an upper bound for the condition number when this is finite. Finally, for any algorithm defined on an open set containing a point with infinite condition number, there is an input for which the algorithm will either run forever or produce a wrong answer. Our impossibility results stem from generalised hardness of approximation -- within the Solvability Complexity Index (SCI) hierarchy framework -- that generalises the classical phenomenon of hardness of approximation.

翻译：计算领域中人工智能技术的引入，因其潜在的幻觉与非鲁棒性问题，使得算法的可信度成为焦点。然而，许多经典方法的可信度尚未得到充分理解。这一情况在特征选择中尤为突出——这是科学、统计学、机器学习等领域的经典问题。其中，LASSO优化问题是标准方法。尽管其应用广泛，但为进行特征选择而尝试计算LASSO极小化器支撑集的算法输出何时可被信任，尚未得到确认。本文证明了：不存在任何（随机化）算法能对所有输入在读取近似输入时（以概率 > 1/2）正确确定LASSO极小化器的支撑集，无论精度与计算能力如何。然而，我们定义了LASSO条件数，并设计了一种高效算法：当输入数据适定（具有有限条件数）时，该算法可在与维度及条件数对数呈多项式时间内计算这些支撑集。对于不适定输入，算法将永久运行，从而永远不会产生错误答案。此外，当条件数有限时，该算法可计算其上界。最后，对于任何定义在包含无穷条件数点的开集上的算法，总存在一个输入使得该算法要么永久运行，要么产生错误答案。我们的不可能性结果源于近似困难性的泛化——在可解性复杂度指数（SCI）层次框架内——这一框架将经典的近似困难性现象进行了推广。