Subgroup-discovery methods allow users to obtain simple descriptions of interesting regions in a dataset. Using constraints in subgroup discovery can enhance interpretability even further. In this article, we focus on two types of constraints: First, we limit the number of features used in subgroup descriptions, making the latter sparse. Second, we propose the novel optimization problem of finding alternative subgroup descriptions, which cover a similar set of data objects as a given subgroup but use different features. We describe how to integrate both constraint types into heuristic subgroup-discovery methods. Further, we propose a novel Satisfiability Modulo Theories (SMT) formulation of subgroup discovery as a white-box optimization problem, which allows solver-based search for subgroups and is open to a variety of constraint types. Additionally, we prove that both constraint types lead to an NP-hard optimization problem. Finally, we employ 27 binary-classification datasets to compare algorithmic and solver-based search for unconstrained and constrained subgroup discovery. We observe that heuristic search methods often yield high-quality subgroups within a short runtime, also in scenarios with constraints.
翻译:子群发现方法使用户能够获取数据集中有趣区域的简洁描述。在子群发现中引入约束可进一步增强可解释性。本文聚焦于两类约束:首先,我们限制子群描述中使用的特征数量,使其具有稀疏性;其次,我们提出寻找替代性子群描述这一新颖优化问题,此类描述覆盖与给定子群相似的数据对象集合,但使用不同的特征。我们阐述了如何将这两类约束整合到启发式子群发现方法中。进一步地,我们提出将子群发现构建为可满足性模理论(SMT)形式下的白盒优化问题,该形式支持基于求解器的子群搜索,并能兼容多种约束类型。此外,我们证明这两类约束均会导致NP难优化问题。最后,我们使用27个二分类数据集,比较了无约束与约束条件下子群发现的启发式搜索与基于求解器的搜索方法。实验表明,即使在约束场景下,启发式搜索方法也常能在较短时间内获得高质量子群。