High-quality curated datasets are essential for training and evaluating AI approaches, but are often lacking in embodied interactive domains where language and physical action are intertwined. In particular, few datasets capture how people acquire motor skills in embodied tasks through verbal instruction over time. To address this gap, we introduce SimCoachCorpus: a unique dataset of race car simulator driving that enables the investigation of rich phenomena during guided and unguided motor skill acquisition. In this dataset, 29 humans were asked to drive in a driving simulator around a race track for approximately ninety minutes. Fifteen participants received one-on-one instruction from a professional performance driving coach, and 14 participants drove without coaching instruction. SimCoachCorpus includes features such as vehicle state and inputs, map (track boundaries and race-line), and cone landmarks. Additionally, these are synchronized with the coach's concurrent verbal feedback and additional terminal feedback at the end of each lap. We also provide high-quality annotations of high-level coaching categories for each concurrent feedback utterance, ratings on students' compliance with coaching advice, and self-reported cognitive load and emotional state of participants (gathered from surveys during the study). The final dataset includes over 20,000 concurrent feedback utterances, over 400 terminal feedback utterances, and over 40 hours of interactive driving data. Our naturalistic interactive dataset can be used to investigate motor learning dynamics, explore linguistic phenomena, and train computational models of teaching and learning. We demonstrate applications of this dataset for in-context learning, imitation learning, and topic modeling. Data is hosted at https://doi.org/10.7910/DVN/W7VTKZ and code is available at https://github.com/ToyotaResearchInstitute/sim_coach_corpus
翻译:高质量筛选数据集对于训练和评估人工智能方法至关重要,但在语言与物理动作相互交织的具身交互领域,此类数据集往往十分匮乏。尤其鲜有数据集能够捕捉人类在具身任务中通过语言指导随时间推移习得运动技能的过程。为填补这一空白,我们推出SimCoachCorpus:一个独特的赛车模拟器驾驶数据集,用于研究引导式与非引导式运动技能习得过程中的丰富现象。该数据集包含29名参与者在模拟驾驶器中围绕赛道行驶约九十分钟的数据。其中15名参与者接受专业赛车教练的一对一指导,14名参与者无教练指导进行驾驶。SimCoachCorpus包含车辆状态与输入、地图(赛道边界与赛车线)以及锥形路标等特征。此外,这些数据与教练的实时口头反馈及每圈结束后的附加终端反馈保持同步。我们还为每条实时反馈话语提供高层教练类别的精细标注、学生对教练建议的遵从度评分,以及参与者自我报告的认知负荷与情绪状态(通过研究期间的问卷调查收集)。最终数据集包含超过20,000条实时反馈话语、400余条终端反馈话语,以及超过40小时的交互驾驶数据。该自然交互数据集可用于研究运动学习动态、探索语言现象,并训练教与学的计算模型。我们展示了该数据集在上下文学习、模仿学习及主题建模中的应用。数据托管于https://doi.org/10.7910/DVN/W7VTKZ,代码见https://github.com/ToyotaResearchInstitute/sim_coach_corpus