Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.
翻译:音节级单元为口语建模和无监督词汇发现提供了紧凑且具有语言学意义的表示,但关于音节化的研究仍分散在不同的实现、数据集和评估协议中。我们提出findsylls,一个模块化、语言无关的工具包,该工具包将经典音节检测器和端到端音节化器统一到通用接口下,支持音节分割、嵌入提取以及多粒度评估。该工具包实现并标准化了广泛使用的方法(例如Sylber、VG-HuBERT),并允许其组件重新组合,从而实现对表示、算法和词元速率的可控比较。我们基于英语和西班牙语语料库,以及来自科诺语(一种文献记录不足的中部曼德语)的新手工标注数据展示了findsylls,说明单个框架如何同时支持高资源和低资源场景下的可重复音节级实验。