With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper's main contributions are drawing attention to downstream effects of confusion errors, and providing an approach to measure and potentially recover from these errors. Specifically, we use a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children's language experience and the association between children's production and their input. By fitting a joint model of speech behavior and algorithm behavior on real and simulated data, we show that classification errors can significantly distort estimates for both the most commonly used \gls{lena}, and a slightly more accurate open-source alternative (the Voice Type Classifier from the ACLEW system). We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution.
翻译:随着可穿戴录音设备的出现,科学家日益采用音频与视频数据的自动化分析方法来测量儿童的经验、行为及发展结果,其中大量研究利用长时音频记录来探究语言习得过程。尽管众多文献报告了最常用自动分类器的准确性与可靠性,但关于分类误差对测量结果和统计推断(如回归分析中相关系数与效应量的估计)的下游影响却较少被探讨。本文的主要贡献在于揭示混淆误差的下游效应,并提供一种测量并可能修正此类误差的方法。具体而言,我们采用贝叶斯方法研究算法误差对关键科学问题的影响,包括兄弟姐妹对儿童语言经验的影响,以及儿童语言产出与其输入之间的关联。通过在实际数据与模拟数据上拟合语音行为与算法行为的联合模型,我们证明分类误差会显著扭曲两类系统的估计结果:既包括最常用的\gls{lena}系统,也涉及准确度稍高的开源替代方案(ACLEW系统中的Voice Type Classifier)。我们进一步表明,采用贝叶斯校准方法还原无偏效应量估计虽能产生有效且具有启发性的结果,但并非万全之策。