Machinery for data analysis often requires a numeric representation of the input. Towards that, a common practice is to embed components of structured data into a high-dimensional vector space. We study the embedding of the tuples of a relational database, where existing techniques are often based on optimization tasks over a collection of random walks from the database. The focus of this paper is on the recent FoRWaRD algorithm that is designed for dynamic databases, where walks are sampled by following foreign keys between tuples. Importantly, different walks have different schemas, or "walk schemes", that are derived by listing the relations and attributes along the walk. Also importantly, different walk schemes describe relationships of different natures in the database. We show that by focusing on a few informative walk schemes, we can obtain tuple embedding significantly faster, while retaining the quality. We define the problem of scheme selection for tuple embedding, devise several approaches and strategies for scheme selection, and conduct a thorough empirical study of the performance over a collection of downstream tasks. Our results confirm that with effective strategies for scheme selection, we can obtain high-quality embeddings considerably (e.g., three times) faster, preserve the extensibility to newly inserted tuples, and even achieve an increase in the precision of some tasks.
翻译:数据分析工具通常需要输入的数值化表示。为此,一种常见做法是将结构化数据的组件嵌入到高维向量空间中。我们研究关系数据库元组的嵌入问题,现有技术通常基于对数据库中随机游走集合的优化任务。本文聚焦于近期提出的FoRWaRD算法,该算法专为动态数据库设计,通过跟随元组间的外键采样游走路径。值得注意的是,不同游走具有不同的模式(即"游走方案"),这些方案通过枚举游走路径上的关系与属性推导得出。同样重要的是,不同游走方案描述数据库中不同性质的关联关系。我们证明,通过聚焦少量信息量丰富的游走方案,可在保持嵌入质量的同时显著加速元组嵌入过程。本文定义了元组嵌入的方案选择问题,提出了多种方案选择方法与策略,并在下游任务集上进行了全面的实证性能研究。实验结果表明,采用有效的方案选择策略,我们能够大幅提升高质量嵌入的生成速度(例如提升三倍),保持对新插入元组的可扩展性,甚至在某些任务中进一步提升精度。