We present HyperSeg, a hyperdimensional computing (HDC) approach to unsupervised dialogue topic segmentation. HDC is a class of vector symbolic architectures that leverages the probabilistic orthogonality of randomly drawn vectors at extremely high dimensions (typically over 10,000). HDC generates rich token representations through its low-cost initialization of many unrelated vectors. This is especially beneficial in topic segmentation, which often operates as a resource-constrained pre-processing step for downstream transcript understanding tasks. HyperSeg outperforms the current state-of-the-art in 4 out of 5 segmentation benchmarks -- even when baselines are given partial access to the ground truth -- and is 10 times faster on average. We show that HyperSeg also improves downstream summarization accuracy. With HyperSeg, we demonstrate the viability of HDC in a major language task. We open-source HyperSeg to provide a strong baseline for unsupervised topic segmentation.
翻译:我们提出HyperSeg——一种基于超维计算(HDC)的无监督对话主题分割方法。HDC属于向量符号架构范畴,它利用极高维度(通常超过10,000维)随机向量在概率上的正交性。HDC通过低成本初始化大量无关向量,生成丰富的令牌表征。这一特性在主题分割任务中尤为有利,因为该任务通常作为下游转录理解任务中资源受限的预处理步骤。HyperSeg在5个分割基准测试中的4项上超越了当前最先进方法——即使基线方法能够部分访问真实标签——且平均速度提升10倍。我们证明HyperSeg还能提升下游摘要任务的准确性。通过HyperSeg,我们验证了HDC在重要语言任务中的可行性。我们开源HyperSeg,为无监督主题分割提供强基线方法。