Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces

Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.

翻译：在大型数据集中识别有意义的概念可为工程设计问题提供宝贵见解。概念识别旨在识别在全部特征联合空间中相似、且仅考虑特征子集时仍相似的非重叠设计实例组。这些特征子集通常包含在特定背景下描述设计的特征，例如构造设计参数、性能值或运行模式。通过单独考虑这些特征子集来评估设计概念的质量是可取的。特别地，有意义的概念不仅应识别密集且分离良好的数据实例组，还应提供在分别考虑预定义特征子集时仍保持稳定的非重叠数据组。在本工作中，我们提出将概念识别视为一种特殊形式的聚类算法，其潜在应用范围远超工程设计。为说明概念识别与经典聚类算法的区别，我们将近期提出的概念识别算法应用于两个合成数据集，并展示所识别解的差异。此外，我们引入互信息度量作为评估解是否在相关子集上返回一致聚类的指标。为支持对概念识别的新理解，我们考虑能源管理领域决策问题的模拟数据集，并证明与常见聚类算法发现的聚类相比，概念识别算法识别的聚类在相关特征子集上更具可解释性，因此更适用于支持决策者。