Knowledge Manifold: A Riemannian Geometric Framework for Semantic Mapping and Geodesic Analysis of Scientific Literature

We present the knowledge manifold: a Riemannian geometric space in which a corpus of documents is arranged according to semantic positional relationships derived from character n-gram TF-IDF representations. The framework proceeds in five tightly coupled stages. First, each document is converted to a character-level n-gram TF-IDF vector (4-7 grams, up to 250,000 features, L2-normalized) and embedded in a two-dimensional knowledge map via constrained stress minimization with repulsion, variance, and centering regularizers. Second, knowledge at an arbitrary query point is estimated through Smoothed Particle Hydrodynamics (SPH) interpolation using a cubic-spline kernel, yielding an interpolated TF-IDF feature vector that can be linguistically characterized. Third, directional knowledge gradients at 0, 45, and 90 degrees are computed from the SPH interpolation map, and pairwise directional similarity is quantified via inner product and cosine similarity. Fourth, a Gaussian Process Regression (GPR) model, with a Constant x RBF + White kernel fitted on a 10-dimensional SVD projection, provides a Bayesian posterior mean, uncertainty estimate, and per-document contribution rate at the query point. Fifth, geodesics in the knowledge space are obtained by minimizing a discrete Riemannian path energy derived from the SPH-induced metric tensor, using L-BFGS-B with seven deterministic initial-path candidates. We apply the formulation to a corpus of 20 papers in fiber-reinforced composite materials and aerospace structural mechanics, showing that the semantic map recovers meaningful research clusters, geodesic paths reveal natural conceptual bridges between distant topics, and SPH/GPR interpolation enables the generation of virtual knowledge: hypothetical paper abstracts describing unstudied but geometrically predicted research directions.

翻译：我们提出知识流形：一种黎曼几何空间，其中文档集合根据基于字符n-gram TF-IDF表征导出的语义位置关系进行排列。该框架包含五个紧密耦合的阶段。首先，每篇文档被转换为字符级n-gram TF-IDF向量（4-7元组，最多250,000个特征，L2归一化），并通过带有排斥、方差和中心化正则化项的约束应力最小化方法嵌入到二维知识地图中。其次，通过采用三次样条核的平滑粒子流体动力学（SPH）插值法，在任意查询点处估计知识状态，得到可进行语言学表征的插值TF-IDF特征向量。第三，基于SPH插值图计算0°、45°和90°方向上的知识梯度，并通过内积与余弦相似度量化成对方向相似性。第四，采用高斯过程回归（GPR）模型（在10维SVD投影上拟合Constant × RBF + White核），在查询点处提供贝叶斯后验均值、不确定性估计及每篇文档贡献率。第五，通过最小化由SPH诱导度量张量导出的离散黎曼路径能量，并利用L-BFGS-B优化器结合七种确定性初始路径候选方案，获取知识空间中的测地线。我们将该公式应用于包含20篇纤维增强复合材料与航空结构力学领域论文的语料库，结果表明：语义地图能重构有意义的研究聚类，测地线路径可揭示远距离主题间的自然概念桥梁，而SPH/GPR插值法能够生成虚拟知识——即描述未经研究但几何可预测研究方向的理论论文摘要。