In large datasets, it is hard to discover and analyze structure. It is thus common to introduce tags or keywords for the items. In applications, such datasets are then filtered based on these tags. Still, even medium-sized datasets with a few tags result in complex and for humans hard-to-navigate systems. In this work, we adopt the method of ordinal factor analysis to address this problem. An ordinal factor arranges a subset of the tags in a linear order based on their underlying structure. A complete ordinal factorization, which consists of such ordinal factors, precisely represents the original dataset. Based on such an ordinal factorization, we provide a way to discover and explain relationships between different items and attributes in the dataset. However, computing even just one ordinal factor of high cardinality is computationally complex. We thus propose the greedy algorithm in this work. This algorithm extracts ordinal factors using already existing fast algorithms developed in formal concept analysis. Then, we leverage to propose a comprehensive way to discover relationships in the dataset. We furthermore introduce a distance measure based on the representation emerging from the ordinal factorization to discover similar items. To evaluate the method, we conduct a case study on different datasets.
翻译:在大型数据集中,发现和分析结构是很困难的。因此,通常为项目引入标签或关键词。在应用中,此类数据集随后会根据这些标签进行过滤。然而,即使只有少量标签的中等规模数据集也会导致复杂的、难以由人类导航的系统。在本工作中,我们采用序数因子分析的方法来解决这个问题。序数因子根据其底层结构将标签子集按线性顺序排列。由这些序数因子构成的完整序数因子分解精确地表示原始数据集。基于这种序数因子分解,我们提供了一种发现和解释数据集中不同项与属性之间关系的方法。然而,即使仅计算一个高基数的序数因子,其计算复杂度也很高。因此,我们在本工作中提出了贪婪算法。该算法利用形式概念分析中已有的快速算法来提取序数因子。然后,我们借助该算法提出了一种全面发现数据集中关系的方法。此外,我们基于序数因子分解产生的表示引入了一种距离度量,以发现相似项。为了评估该方法,我们在不同数据集上进行了案例研究。