Locality-Preserving Minimal Perfect Hashing of k-mers - 专知论文

会员服务 ·

0

哈希 · 映射 · DOT · 局部性 · 哈希学习 ·

2023 年 4 月 12 日

Locality-Preserving Minimal Perfect Hashing of k-mers

翻译：保持邻接性的k-mer最小完美哈希

Giulio Ermanno Pibiri,Yoshihiro Shibuya,Antoine Limasset

from arxiv, Accepted to ISMB 2023

Minimal perfect hashing is the problem of mapping a static set of $n$ distinct keys into the address space $\{1,\ldots,n\}$ bijectively. It is well-known that $n\log_2(e)$ bits are necessary to specify a minimal perfect hash function (MPHF) $f$, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of $f$. For example, consider a string and the set of all its distinct $k$-mers as input keys: since two consecutive $k$-mers share an overlap of $k-1$ symbols, it seems possible to beat the classic $\log_2(e)$ bits/key barrier in this case. Moreover, we would like $f$ to map consecutive $k$-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for $f$, resulting in a better evaluation time when querying consecutive $k$-mers. Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for $k$-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing $k$ and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.

翻译：最小完美哈希是将一组静态的$n$个不同键双射映射到地址空间$\{1,\ldots,n\}$的问题。众所周知，当不利用输入键的额外信息时，指定一个最小完美哈希函数（MPHF）$f$需要$n\log_2(e)$比特。然而在实际中，输入键往往具有内在关联性，我们可以利用这些关联来降低$f$的比特复杂度。例如，考虑一个字符串及其所有不同$k$-mer作为输入键：由于两个连续$k$-mer共享$k-1$个符号的重叠，在这种情况下似乎可能突破经典的$\log_2(e)$比特/键的界限。此外，我们希望$f$能将连续的$k$-mer映射到连续的地址，从而在值域中尽可能保留它们的关联性。这在实践中是一个有用的特性，因为它保证了$f$具有某种程度的局部引用性，从而在查询连续$k$-mer时获得更优的评估时间。基于这些前提，我们启动了一类新型保持邻接性MPHF的研究，该函数专为从字符串集合中连续提取的$k$-mer设计。我们设计了一种空间使用量随$k$增大而减少的构建方法，并讨论了该方法的实际实现实验：在实践中，使用我们的方法构建的函数在空间上可比文献中最有效的MPHF小数倍，甚至查询速度更快。

0

相关内容

【硬核书】规划算法 (Planning Algorithm)，1023页pdf，Steven M. Illinois大学

【硬核书】规划算法 (Planning Algorithm)，1023页pdf，Steven M. Illinois大学

专知会员服务

167+阅读 · 2022年4月10日

【CVPR 2022】基于本地正则化和稀疏化差分隐私的联邦学习，Differentially Private Federated Learning with Local Regularization and Sparsification

【CVPR 2022】基于本地正则化和稀疏化差分隐私的联邦学习，Differentially Private Federated Learning with Local Regularization and Sparsification

专知会员服务

17+阅读 · 2022年3月19日

【经典书】线性代数，436页pdf

专知会员服务

78+阅读 · 2021年3月16日

【AAAI2021】属性引导对抗训练的自然扰动鲁棒性

专知会员服务

26+阅读 · 2021年1月21日

【AAAI2021】克服图神经网络灾难性遗忘，Overcoming Catastrophic Forgetting in GNN

【AAAI2021】克服图神经网络灾难性遗忘，Overcoming Catastrophic Forgetting in GNN

专知会员服务

18+阅读 · 2020年12月15日

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

专知会员服务

13+阅读 · 2020年6月8日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

专知会员服务

93+阅读 · 2020年2月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【泡泡一分钟】通过学习轮式里程计和IMU误差的定位

【泡泡一分钟】通过学习轮式里程计和IMU误差的定位

泡泡机器人SLAM

133+阅读 · 2019年9月12日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

PyTorch中在反向传播前为什么要手动将梯度清零？

PyTorch中在反向传播前为什么要手动将梯度清零？

极市平台

39+阅读 · 2019年1月23日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

专知

19+阅读 · 2018年6月26日

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

专知

12+阅读 · 2018年5月18日

单位球面中极小超曲面的第一特征值的Yau的猜想

国家自然科学基金

0+阅读 · 2015年12月31日

Heisenberg 群上的 k-平面变换

国家自然科学基金

0+阅读 · 2015年12月31日

离散时间马氏链的泛函不等式及遍历性

国家自然科学基金

0+阅读 · 2014年12月31日

水声传感器网络的高效时间同步与定位

国家自然科学基金

0+阅读 · 2013年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

空间通信片上网络数字系统形式化验证与可靠性分析

国家自然科学基金

0+阅读 · 2012年12月31日

k-稀疏恢复限制等距常数上界及压缩感知算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

关于Cayley图的若干研究

国家自然科学基金

0+阅读 · 2012年12月31日

Heisenberg群上拟线性次椭圆方程解的多重性与正则性

国家自然科学基金

0+阅读 · 2011年12月31日

模-相对Hochschild同调与上同调

国家自然科学基金

0+阅读 · 2011年12月31日

Robust mean change point testing in high-dimensional data with heavy tails

Arxiv

0+阅读 · 2023年5月30日

The Isomorphism Problem of Power Graphs and a Question of Cameron

Arxiv

0+阅读 · 2023年5月30日

Information Theoretical Importance Sampling Clustering

Arxiv

0+阅读 · 2023年5月30日

On the Tightness of the Moment Accountant for DP-SGD

Arxiv

0+阅读 · 2023年5月30日

Optimal approximation of infinite-dimensional holomorphic functions

Arxiv

0+阅读 · 2023年5月29日

Fair Clustering via Hierarchical Fair-Dirichlet Process

Arxiv

0+阅读 · 2023年5月27日

On the Noise Sensitivity of the Randomized SVD

Arxiv

0+阅读 · 2023年5月27日

Optimizing NOTEARS Objectives via Topological Swaps

Arxiv

0+阅读 · 2023年5月26日

Sparse juntas on the biased hypercube

Arxiv

0+阅读 · 2023年5月26日

A dichotomy for succinct representations of homomorphisms

Arxiv

0+阅读 · 2023年5月26日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

3+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

4+阅读 · 6月18日

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

9+阅读 · 6月18日

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

8+阅读 · 6月18日

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

5+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

7+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

6+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

10+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

7+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

4+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

6+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

7+阅读 · 6月17日

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

6+阅读 · 6月17日

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

专知会员服务

5+阅读 · 6月17日

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

6+阅读 · 6月16日

相关VIP内容

【硬核书】规划算法 (Planning Algorithm)，1023页pdf，Steven M. Illinois大学

【硬核书】规划算法 (Planning Algorithm)，1023页pdf，Steven M. Illinois大学

专知会员服务

167+阅读 · 2022年4月10日

【CVPR 2022】基于本地正则化和稀疏化差分隐私的联邦学习，Differentially Private Federated Learning with Local Regularization and Sparsification

【CVPR 2022】基于本地正则化和稀疏化差分隐私的联邦学习，Differentially Private Federated Learning with Local Regularization and Sparsification

专知会员服务

17+阅读 · 2022年3月19日

【经典书】线性代数，436页pdf

专知会员服务

78+阅读 · 2021年3月16日

【AAAI2021】属性引导对抗训练的自然扰动鲁棒性

专知会员服务

26+阅读 · 2021年1月21日

【AAAI2021】克服图神经网络灾难性遗忘，Overcoming Catastrophic Forgetting in GNN

【AAAI2021】克服图神经网络灾难性遗忘，Overcoming Catastrophic Forgetting in GNN

专知会员服务

18+阅读 · 2020年12月15日

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

专知会员服务

13+阅读 · 2020年6月8日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

专知会员服务

93+阅读 · 2020年2月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【泡泡一分钟】通过学习轮式里程计和IMU误差的定位

【泡泡一分钟】通过学习轮式里程计和IMU误差的定位

泡泡机器人SLAM

133+阅读 · 2019年9月12日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

PyTorch中在反向传播前为什么要手动将梯度清零？

PyTorch中在反向传播前为什么要手动将梯度清零？

极市平台

39+阅读 · 2019年1月23日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

专知

19+阅读 · 2018年6月26日

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

专知

12+阅读 · 2018年5月18日

相关论文

Robust mean change point testing in high-dimensional data with heavy tails

Arxiv

0+阅读 · 2023年5月30日

The Isomorphism Problem of Power Graphs and a Question of Cameron

Arxiv

0+阅读 · 2023年5月30日

Information Theoretical Importance Sampling Clustering

Arxiv

0+阅读 · 2023年5月30日

On the Tightness of the Moment Accountant for DP-SGD

Arxiv

0+阅读 · 2023年5月30日

Optimal approximation of infinite-dimensional holomorphic functions

Arxiv

0+阅读 · 2023年5月29日

Fair Clustering via Hierarchical Fair-Dirichlet Process

Arxiv

0+阅读 · 2023年5月27日

On the Noise Sensitivity of the Randomized SVD

Arxiv

0+阅读 · 2023年5月27日

Optimizing NOTEARS Objectives via Topological Swaps

Arxiv

0+阅读 · 2023年5月26日

Sparse juntas on the biased hypercube

Arxiv

0+阅读 · 2023年5月26日

A dichotomy for succinct representations of homomorphisms

Arxiv

0+阅读 · 2023年5月26日

相关基金

单位球面中极小超曲面的第一特征值的Yau的猜想

国家自然科学基金

0+阅读 · 2015年12月31日

Heisenberg 群上的 k-平面变换

国家自然科学基金

0+阅读 · 2015年12月31日

离散时间马氏链的泛函不等式及遍历性

国家自然科学基金

0+阅读 · 2014年12月31日

水声传感器网络的高效时间同步与定位

国家自然科学基金

0+阅读 · 2013年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

空间通信片上网络数字系统形式化验证与可靠性分析

国家自然科学基金

0+阅读 · 2012年12月31日

k-稀疏恢复限制等距常数上界及压缩感知算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

关于Cayley图的若干研究

国家自然科学基金

0+阅读 · 2012年12月31日

Heisenberg群上拟线性次椭圆方程解的多重性与正则性

国家自然科学基金

0+阅读 · 2011年12月31日

模-相对Hochschild同调与上同调

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员