Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on separate and conquer - a well established heuristic for decision rule induction. Multiple passes accompanied with an attribute penalization scheme provide contrast sets describing same examples with different attributes, distinguishing presented approach from the standard separate and conquer. The algorithm was also generalized for regression and survival data allowing identification of contrast sets whose label attribute/survival prognosis is consistent with the label/prognosis for the predefined contrast groups. This feature, not provided by the existing approaches, further extends the usability of RuleKit-CS. Experiments on over 130 data sets from various areas and detailed analysis of selected cases confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm was implemented as a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit). Keywords: contrast sets, separate and conquer, regression, survival
翻译:识别组间差异是最重要的知识发现问题之一。该方法(亦称对比集挖掘)广泛应用于医学、工业或经济学等领域。本文提出RuleKit-CS算法,该算法基于分离与征服(一种用于决策规则归纳的成熟启发式方法)进行对比集挖掘。结合属性惩罚机制的多轮遍历,能够提供用不同属性描述相同样本的对比集,从而将所提方法与标准分离与征服方法区分开来。该算法还被推广至回归与生存数据,允许识别其标签属性/生存预后与预定义对比组标签/预后一致的对比集。现有方法未能提供这一功能,该特性进一步拓展了RuleKit-CS的适用性。在来自不同领域的130余个数据集上的实验以及对选定案例的详细分析,证实RuleKit-CS是发现定义组间差异的有效工具。该算法已作为RuleKit工具包的一部分实现,可在GitHub上根据GNU AGPL 3许可证获取(https://github.com/adaa-polsl/RuleKit)。关键词:对比集,分离与征服,回归,生存