Decision trees are highly interpretable models for solving classification problems in machine learning (ML). The standard ML algorithms for training decision trees are fast but generate suboptimal trees in terms of accuracy. Other discrete optimization models in the literature address the optimality problem but only work well on relatively small datasets. \cite{firat2020column} proposed a column-generation-based heuristic approach for learning decision trees. This approach improves scalability and can work with large datasets. In this paper, we describe improvements to this column generation approach. First, we modify the subproblem model to significantly reduce the number of subproblems in multiclass classification instances. Next, we show that the data-dependent constraints in the master problem are implied, and use them as cutting planes. Furthermore, we describe a separation model to generate data points for which the linear programming relaxation solution violates their corresponding constraints. We conclude by presenting computational results that show that these modifications result in better scalability.
翻译:决策树是机器学习中用于解决分类问题的高度可解释模型。标准机器学习算法训练决策树速度快,但生成的树在精度上并非最优。文献中的其他离散优化模型虽能解决最优性问题,但仅适用于相对较小的数据集。\cite{firat2020column} 提出了一种基于列生成的启发式方法用于学习决策树,该方法提升了可扩展性,能够处理大规模数据集。本文对该列生成方法进行了改进:首先,修改子问题模型以显著减少多分类实例中的子问题数量;其次,证明主问题中数据相关约束是隐含约束,并将其用作割平面;此外,提出一个分离模型用于生成违反线性规划松弛解的对应约束的数据点。最后,通过计算结果表明这些改进带来了更好的可扩展性。