AMUSE: Adaptive Multi-Segment Encoding for Dataset Watermarking

Curating high quality datasets that play a key role in the emergence of new AI applications requires considerable time, money, and computational resources. So, effective ownership protection of datasets is becoming critical. Recently, to protect the ownership of an image dataset, imperceptible watermarking techniques are used to store ownership information (i.e., watermark) into the individual image samples. Embedding the entire watermark into all samples leads to significant redundancy in the embedded information which damages the watermarked dataset quality and extraction accuracy. In this paper, a multi-segment encoding-decoding method for dataset watermarking (called AMUSE) is proposed to adaptively map the original watermark into a set of shorter sub-messages and vice versa. Our message encoder is an adaptive method that adjusts the length of the sub-messages according to the protection requirements for the target dataset. Existing image watermarking methods are then employed to embed the sub-messages into the original images in the dataset and also to extract them from the watermarked images. Our decoder is then used to reconstruct the original message from the extracted sub-messages. The proposed encoder and decoder are plug-and-play modules that can easily be added to any watermarking method. To this end, extensive experiments are preformed with multiple watermarking solutions which show that applying AMUSE improves the overall message extraction accuracy upto 28% for the same given dataset quality. Furthermore, the image dataset quality is enhanced by a PSNR of $\approx$2 dB on average, while improving the extraction accuracy for one of the tested image watermarking methods.

翻译：策展高质量数据集是推动新型AI应用出现的关键因素，然而这需要耗费大量时间、资金和计算资源。因此，数据集的有效所有权保护变得至关重要。最近，为保护图像数据集的所有权，研究者们采用不可察觉的水印技术，将所有权信息（即水印）嵌入到每个图像样本中。将完整水印嵌入所有样本会导致嵌入信息的显著冗余，进而损害含水印数据集的质量与提取精度。本文提出一种用于数据集水印的多段编码-解码方法（称为AMUSE），可自适应地将原始水印映射为一组较短的子消息，并反向映射。我们的消息编码器可根据目标数据集的保护需求，自适应调整子消息长度。随后采用现有图像水印方法将子消息嵌入数据集的原始图像中，并从含水印图像中提取子消息。解码器则用于从提取的子消息重构原始消息。所提出的编码器与解码器即插即用，可轻松集成至任意水印方法。基于此，我们采用多种水印方案进行了大量实验，结果表明：在保持相同数据集质量的前提下，应用AMUSE可将整体消息提取准确率提升高达28%。此外，针对所测试的一种图像水印方法，该方法在提升提取精度的同时，还将图像数据集质量平均提升了约2 dB的PSNR。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日