具身人工智能数据集中的语言多样性局限 (Limited Linguistic Diversity in Embodied AI Datasets)

Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.

翻译：语言在视觉-语言-动作（VLA）模型中起着关键作用，然而用于训练和评估这些系统的数据集的语言特性仍缺乏充分记录。本研究对多个广泛使用的VLA语料库进行了系统性数据集审计，旨在刻画这些数据集实际包含的指令类型及其提供的语言多样性程度。我们从互补维度对指令语言进行量化分析——包括词汇多样性、重复与重叠度、语义相似性以及句法复杂性。分析表明，许多数据集依赖高度重复、模板化的命令，其结构变化有限，导致指令形式的分布范围狭窄。我们将这些发现定位为对当前VLA训练与评估数据中可用语言信号的描述性记录，旨在支持更细致的数据集报告、更基于原则的数据集选择，以及通过针对性整理或增强策略来拓宽语言覆盖范围。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CMU博士论文】异构数据导航：构建面向多样化数据类型、领域及复杂性的 AI 系统

专知会员服务

17+阅读 · 2月12日

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

【阿姆斯特丹博士论文】多语言性与多文化主义：迈向高效和包容的神经语言模型

专知会员服务

10+阅读 · 2025年6月4日