Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora. For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination. Strategies such as leaderboards with hidden answers, or using test data which is guaranteed to be unseen, are expensive and become fragile with time. Assuming that all relevant actors value clean test data and will cooperate to mitigate data contamination, what can be done? We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived data along with the data. These strategies are practical and can be effective in preventing data contamination.
翻译:摘要:随着基于大规模自动爬取语料预训练的模型兴起,数据污染已变得普遍且棘手。对于封闭模型,训练数据成为商业秘密;即便是开放模型,检测数据污染也并非易事。诸如采用隐藏答案的排行榜或确保测试数据未被模型见过的策略,不仅成本高昂,且随时间推移会变得脆弱。假设所有相关参与者均重视清洁测试数据并愿合作以减轻数据污染,我们能够采取哪些措施?本文提出三项可产生实际效果的策略:(1)公开的测试数据应使用公钥加密,并通过许可协议禁止衍生分发;(2)要求封闭API持有者提供训练排除控制机制,拒绝在无此机制的情况下进行模型评估以保护测试数据;(3)避免使用互联网上附带解决方案的数据,并在发布网络衍生数据时同步提供其网页上下文信息。这些策略切实可行,能够有效防止数据污染。