Taiwanese opera (Kua-á-hì), a major form of local theatrical tradition, underwent extensive television adaptation notably by pioneers like Iûnn Lē-hua. These videos, while potentially valuable for in-depth studies of Taiwanese opera, often have low quality and require substantial manual effort during data preparation. To streamline this process, we developed an interactive system for real-time OCR correction and a two-step approach integrating OCR-driven segmentation with Speech and Music Activity Detection (SMAD) to efficiently identify vocal segments from archival episodes with high precision. The resulting dataset, consisting of vocal segments and corresponding lyrics, can potentially supports various MIR tasks such as lyrics identification and tune retrieval. Code is available at https://github.com/z-huang/ocr-subtitle-editor .
翻译:台湾歌仔戏作为一项重要的地方戏剧传统,经历了广泛的电视改编,尤以杨丽花等先驱者为代表。这些视频虽对深入研究歌仔戏具有潜在价值,但通常质量较低,且在数据准备阶段需要大量人工操作。为简化此流程,我们开发了一个用于实时OCR校正的交互式系统,并提出一种两步法:该方法将OCR驱动的分割与语音和音乐活动检测(SMAD)相结合,从而高效且高精度地从档案剧集中识别出人声片段。最终生成的数据集包含人声片段及对应歌词,可潜在支持多种音乐信息检索任务,如歌词识别与曲调检索。代码发布于 https://github.com/z-huang/ocr-subtitle-editor 。