We train an identity verification architecture and evaluate modifications to the part of the model that combines audio and visual representations, including in scenarios where one input is missing in either of two examples to be compared. We report results on the Voxceleb1-E test set that suggest averaging the output embeddings improves error rate in the full-modality setting and when a single modality is missing, and makes more complete use of the embedding space than systems which use shared layers and discuss possible reasons for this behavior.
翻译:我们训练了一种身份验证架构,并评估了模型中结合音频与视觉表示部分的改进方案,包括在待比较的两个样本中缺失某一输入模态的场景。通过在Voxceleb1-E测试集上报告的结果表明:对输出嵌入进行平均处理可降低全模态及单一模态缺失情况下的错误率,且相比使用共享层的系统能更充分利用嵌入空间。我们进一步讨论了导致该行为的可能原因。