Data contamination has become especially prevalent and challenging with the rise of models pretrained on very large, automatically-crawled corpora. For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to ascertain whether a particular test instance has been compromised. Strategies such as live leaderboards with hidden answers, or using test data which is guaranteed to be unseen, are expensive and become fragile with time. Assuming that all relevant actors value clean test data and will cooperate to mitigate data contamination, what can be done? We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate until demands are met; (3) in case of test data based on internet text, avoid data which appears with its solution on the internet, and release the context of internet-derived data along with the data. These strategies are practical and can be effective in preventing data contamination and allowing trustworthy evaluation of models' capabilities.
翻译:数据污染问题随着基于大规模自动抓取语料库预训练的模型兴起而变得尤为普遍且具有挑战性。对于闭源模型,训练数据成为商业机密;即使是开源模型,也难轻易判定特定测试实例是否已被污染。诸如采用隐藏答案的实时排行榜或使用确保未被见过测试数据等策略成本高昂且会随时间推移而失效。假设所有相关方均珍视清洁测试数据并愿意合作缓解数据污染,我们究竟能采取哪些措施?本文提出三项可行策略:(1) 公开的测试数据应使用公钥加密,并通过许可协议禁止衍生分发;(2) 要求闭源API持有者提供训练排除控制机制,在要求未获满足前拒绝进行模型评估;(3) 对于基于互联网文本的测试数据,应避免选用答案已在互联网上出现的内容,并随数据一同发布互联网衍生数据的上下文信息。这些实用策略能有效预防数据污染,确保模型能力评估的可信度。