Large-scale lyric corpora present unique challenges for data-driven analysis, including the absence of reliable annotations, multilingual content, and high levels of stylistic repetition. Most existing approaches rely on supervised classification, genre labels, or coarse document-level representations, limiting their ability to uncover latent semantic structure. We present a graph-based framework for unsupervised discovery and evaluation of semantic communities in K-pop lyrics using line-level semantic representations. By constructing a similarity graph over lyric texts and applying community detection, we uncover stable micro-theme communities without genre, artist, or language supervision. We further identify boundary-spanning songs via graph-theoretic bridge metrics and analyse their structural properties. Across multiple robustness settings, boundary-spanning lyrics exhibit higher lexical diversity and lower repetition compared to core community members, challenging the assumption that hook intensity or repetition drives cross-theme connectivity. Our framework is language-agnostic and applicable to unlabeled cultural text corpora.
翻译:大规模歌词语料库为数据驱动分析带来了独特挑战,包括缺乏可靠标注、多语言内容以及高程度的风格重复。现有方法大多依赖监督分类、流派标签或粗糙的文档级表示,限制了其揭示潜在语义结构的能力。本文提出一种基于图的无监督框架,利用行级语义表示发现并评估K-pop歌词中的语义社群。通过构建歌词文本的相似度图并应用社群检测,我们在无需流派、艺人或语言监督的情况下发现了稳定的微主题社群。进一步通过图论桥接度量识别跨界歌曲,并分析其结构特性。在多种鲁棒性设置下,与核心社群成员相比,跨界歌词展现出更高的词汇多样性及更低的重复率,这对"副歌强度或重复性驱动跨主题连接"的假设提出了挑战。本框架具有语言无关性,可适用于无标注的文化文本语料库。