TalkBank is an online database that facilitates the sharing of linguistics research data. However, the existing TalkBank's API has limited data filtering and batch processing capabilities. To overcome these limitations, this paper introduces a pipeline framework that employs a hierarchical search approach, enabling efficient complex data selection. This approach involves a quick preliminary screening of relevant corpora that a researcher may need, and then perform an in-depth search for target data based on specific criteria. The identified files are then indexed, providing easier access for future analysis. Furthermore, the paper demonstrates how data from different studies curated with the framework can be integrated by standardizing and cleaning metadata, allowing researchers to extract insights from a large, integrated dataset. While being designed for TalkBank, the framework can also be adapted to process data from other open-science platforms.
翻译:TalkBank是一个促进语言学研究数据共享的在线数据库。然而,现有TalkBank的API在数据过滤和批处理能力上存在局限。为克服这些限制,本文提出一种采用分层搜索策略的流水线框架,能够实现高效复杂数据选择。该策略首先对研究者可能需要的相关语料库进行快速初步筛选,再依据特定标准对目标数据进行深度搜索。随后对筛选出的文件建立索引,以便未来分析时更易获取。此外,本文展示了如何通过标准化和清洗元数据,整合采用该框架整理的不同研究数据,使研究者能够从大规模整合数据集中提取洞见。该框架虽专为TalkBank设计,但也可适配处理其他开放科学平台的数据。