Temporal sequences have become pervasive in various real-world applications. Consequently, the volume of data generated in the form of continuous time-event sequence(s) or CTES(s) has increased exponentially in the past few years. Thus, a significant fraction of the ongoing research on CTES datasets involves designing models to address downstream tasks such as next-event prediction, long-term forecasting, sequence classification etc. The recent developments in predictive modeling using marked temporal point processes (MTPP) have enabled an accurate characterization of several real-world applications involving the CTESs. However, due to the complex nature of these CTES datasets, the task of large-scale retrieval of temporal sequences has been overlooked by the past literature. In detail, by CTES retrieval we mean that for an input query sequence, a retrieval system must return a ranked list of relevant sequences from a large corpus. To tackle this, we propose NeuroSeqRet, a first-of-its-kind framework designed specifically for end-to-end CTES retrieval. Specifically, NeuroSeqRet introduces multiple enhancements over standard retrieval frameworks and first applies a trainable unwarping function on the query sequence which makes it comparable with corpus sequences, especially when a relevant query-corpus pair has individually different attributes. Next, it feeds the unwarped query sequence and the corpus sequence into MTPP-guided neural relevance models. We develop four variants of the relevance model for different kinds of applications based on the trade-off between accuracy and efficiency. We also propose an optimization framework to learn binary sequence embeddings from the relevance scores, suitable for the locality-sensitive hashing. Our experiments show the significant accuracy boost of NeuroSeqRet as well as the efficacy of our hashing mechanism.
翻译:时间序列已在各类实际应用中变得无处不在。因此,以连续时间事件序列形式生成的数据量在过去几年呈指数级增长。当前关于连续时间事件序列数据集的相当一部分研究聚焦于设计模型以处理下游任务,例如下一事件预测、长期预测、序列分类等。近期基于标注时间点过程的预测建模进展,使得对涉及连续时间事件序列的多个真实应用场景进行精准刻画成为可能。然而,由于这些连续时间事件序列数据集的复杂特性,大规模时间序列检索任务在既往文献中尚未得到充分研究。具体而言,连续时间事件序列检索是指:对于输入查询序列,检索系统需从大型语料库中返回按相关性排序的候选序列列表。为解决此问题,我们提出NeuroSeqRet——首个专为端到端连续时间事件序列检索设计的框架。具体而言,NeuroSeqRet在标准检索框架基础上引入多项改进:首先对查询序列应用可学习反卷函数,使其与语料库序列具有可比性,尤其在相关查询-语料对具有不同个体属性时优势显著;随后将反卷后的查询序列与语料库序列输入至基于标注时间点过程的神经相关性模型。我们基于准确率与效率的权衡,开发了四种相关性模型变体以适应不同应用场景。同时提出优化框架,从相关性分数中学习适用于局部敏感哈希的二进制序列嵌入。实验表明,NeuroSeqRet在显著提升检索准确性的同时,所提出的哈希机制也具有高效性。