Machine Learning as a Service (MLaaS) is a popular cloud-based solution for customers who aim to use an ML model but lack training data, computation resources, or expertise in ML. In this case, the training datasets are typically a private possession of the ML or data companies and are inaccessible to the customers, but the customers still need an approach to confirm that the training datasets meet their expectations and fulfil regulatory measures like fairness. However, no existing work addresses the above customers' concerns. This work is the first attempt to solve this problem, taking data origin as an entry point. We first define origin membership measurement and based on this, we then define diversity and fairness metrics to address customers' concerns. We then propose a strategy to estimate the values of these two metrics in the inaccessible training dataset, combining shadow training techniques from membership inference and an efficient featurization scheme in multiple instance learning. The evaluation contains an application of text review polarity classification applications based on the language BERT model. Experimental results show that our solution can achieve up to 0.87 accuracy for membership inspection and up to 99.3% confidence in inspecting diversity and fairness distribution.
翻译:机器学习即服务(MLaaS)是一种基于云的流行解决方案,适用于希望使用机器学习模型但缺乏训练数据、计算资源或机器学习专业知识的客户。在此场景下,训练数据集通常由机器学习公司或数据公司私有持有,客户无法访问,但客户仍需一种方法来确认训练数据集符合其预期并满足公平性等监管要求。然而,现有研究尚未解决上述客户关切。本文首次尝试解决该问题,以数据来源为切入点。我们首先定义来源成员关系度量,并在此基础上定义多样性和公平性度量以应对客户关切。随后,我们提出一种策略,结合成员推断中的影子训练技术与多实例学习中的高效特征化方案,来估计不可访问训练数据集中这两项度量的值。评估环节基于语言BERT模型进行了文本情感极性分类应用实验。实验结果表明,我们的方案在成员关系审查中准确率可达0.87,在审查多样性与公平性分布时置信度可达99.3%。