Motivated by extracting and summarizing relevant information in short sentence settings, such as satisfaction questionnaires, hotel reviews, and X/Twitter, we study the problem of clustering words in a hierarchical fashion. In particular, we focus on the problem of clustering with horizontal and vertical structural constraints. Horizontal constraints are typically cannot-link and must-link among words, while vertical constraints are precedence constraints among cluster levels. We overcome state-of-the-art bottlenecks by formulating the problem in two steps: first, as a soft-constrained regularized least-squares which guides the result of a sequential graph coarsening algorithm towards the horizontal feasible set. Then, flat clusters are extracted from the resulting hierarchical tree by computing optimal cut heights based on the available constraints. We show that the resulting approach compares very well with respect to existing algorithms and is computationally light.
翻译:受短句场景(如满意度问卷、酒店评论和X/Twitter)中提取与总结相关信息的驱动,我们研究层次化词语聚类问题。特别关注具有横向与纵向结构约束的聚类问题:横向约束通常涉及词语间的不可连接与必须连接关系,而纵向约束则涉及聚类层级间的优先关系。我们通过两步法克服现有技术瓶颈:首先将其建模为软约束正则化最小二乘问题,引导序贯图粗化算法结果趋近横向可行集;然后基于现有约束计算最优切割高度,从生成的层次树中提取平面聚类。实验表明,该方法与现有算法相比表现优异,且计算量轻。