This project involved participation in the DCASE 2022 Competition (Task 6) which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based Audio Retrieval. The first subtask involved the generation of a textual description for audio samples, while the goal of the second was to find audio samples within a fixed dataset that match a given description. For both subtasks, the Clotho dataset was used. The models were evaluated on BLEU1, BLEU2, BLEU3, ROUGEL, METEOR, CIDEr, SPICE, and SPIDEr scores for audio captioning and R1, R5, R10 and mARP10 scores for audio retrieval. We have conducted a handful of experiments that modify the baseline models for these tasks. Our final architecture for Automated Audio Captioning is close to the baseline performance, while our model for Language-Based Audio Retrieval has surpassed its counterpart.
翻译:本项目参与了DCASE 2022竞赛(任务6),该任务包含两个子任务:(1)自动音频描述生成与(2)基于语言的音频检索。第一个子任务涉及为音频样本生成文本描述,而第二个子任务的目标是在固定数据集内找到与给定描述匹配的音频样本。两个子任务均使用Clotho数据集。模型评估指标包括:音频描述生成任务使用BLEU1、BLEU2、BLEU3、ROUGEL、METEOR、CIDEr、SPICE及SPIDEr分数;音频检索任务使用R1、R5、R10及mARP10分数。我们开展了一系列实验对基线模型进行改进。最终,自动音频描述生成的架构性能接近基线水平,而基于语言的音频检索模型则超越了基线表现。