Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool'' as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.
翻译:文本摘要是自然语言处理(NLP)的一项基础任务,信息爆炸使得长文档处理需求日益增长,摘要技术变得至关重要。现有研究主要集中于模型改进和句子级剪枝,但往往忽视全局结构,导致摘要连贯性受损、下游任务性能减弱。部分研究采用大语言模型(LLMs),虽能获得更高准确率,但需付出巨大的资源和时间成本。为解决这些问题,我们提出了GloSA-sum,这是首个通过拓扑数据分析(TDA)实现全局结构感知的摘要方法。GloSA-sum能高效地概括文本,同时保留语义核心与逻辑依赖关系。具体而言,我们从句子嵌入构建语义加权图,通过持续同调识别核心语义与逻辑结构,并将其保存在“保护池”中作为摘要的骨干框架。我们设计了一种拓扑引导的迭代策略,利用轻量级代理指标近似评估句子重要性,避免重复的高成本计算,从而在提升效率的同时保持结构完整性。为进一步增强长文本处理能力,我们提出了一种分层策略,整合了分段级摘要与全局摘要。在多个数据集上的实验表明,GloSA-sum能在减少冗余的同时保持语义与逻辑完整性,在准确率与效率间取得平衡,并通过缩短上下文同时保留关键推理链,进一步有益于LLM的下游任务。