Managing the semantic quality of the categorization in large textual datasets, such as Wikipedia, presents significant challenges in terms of complexity and cost. In this paper, we propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset and its associated categories into a latent space. We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories. Our graphical approach is powered by Convex Hull, while we utilize Hierarchical Navigable Small Worlds (HNSWs) for the hierarchical approach. As a solution to the information loss caused by the dimensionality reduction, we modulate the following mathematical solution: an exponential decay function driven by the Euclidean distances between the high-dimensional encodings of the textual categories. This function represents a filter built around a contextual category and retrieves items with a certain Reconsideration Probability (RP). Retrieving high-RP items serves as a tool for database administrators to improve data groupings by providing recommendations and identifying outliers within a contextual framework.
翻译:管理大型文本数据集(如维基百科)中分类的语义质量在复杂性和成本方面存在重大挑战。本文提出利用Transformer模型将维基百科数据集及其相关类别中的文本语义信息蒸馏到潜在空间中。随后,我们基于这些编码探索不同方法,以评估和增强类别的语义一致性。我们的图形化方法借助凸包(Convex Hull)实现,而层次化方法则采用分层可导航小世界(HNSW)。作为对因降维导致信息损失的补救措施,我们调用了以下数学解决方案:一个由文本类别高维编码之间的欧几里得距离驱动的指数衰减函数。该函数围绕上下文类别构建滤波器,并以一定重考虑概率(Reconsideration Probability, RP)检索项目。检索高RP项目可作为数据库管理员改善数据分组的工具,通过提供建议并在上下文框架内识别异常值来实现。