Provably Valid and Diverse Mutations of Real-World Media Data for DNN Testing

Deep neural networks (DNNs) often accept high-dimensional media data (e.g., photos, text, and audio) and understand their perceptual content (e.g., a cat). To test DNNs, diverse inputs are needed to trigger mis-predictions. Some preliminary works use byte-level mutations or domain-specific filters (e.g., foggy), whose enabled mutations may be limited and likely error-prone. SOTA works employ deep generative models to generate (infinite) inputs. Also, to keep the mutated inputs perceptually valid (e.g., a cat remains a "cat" after mutation), existing efforts rely on imprecise and less generalizable heuristics. This study revisits two key objectives in media input mutation - perception diversity (DIV) and validity (VAL) - in a rigorous manner based on manifold, a well-developed theory capturing perceptions of high-dimensional media data in a low-dimensional space. We show important results that DIV and VAL inextricably bound each other, and prove that SOTA generative model-based methods fundamentally fail to mutate real-world media data (either sacrificing DIV or VAL). In contrast, we discuss the feasibility of mutating real-world media data with provably high DIV and VAL based on manifold. We concretize the technical solution of mutating media data of various formats (images, audios, text) via a unified manner based on manifold. Specifically, when media data are projected into a low-dimensional manifold, the data can be mutated by walking on the manifold with certain directions and step sizes. When contrasted with the input data, the mutated data exhibit encouraging DIV in the perceptual traits (e.g., lying vs. standing dog) while retaining reasonably high VAL (i.e., a dog remains a dog). We implement our techniques in DEEPWALK for testing DNNs. DEEPWALK outperforms prior methods in testing comprehensiveness and can find more error-triggering inputs with higher quality.

翻译：深度神经网络（DNN）常接收高维媒体数据（如照片、文本和音频）并理解其感知内容（如"猫"）。为测试DNN，需要多样化的输入以触发错误预测。早期工作采用字节级变异或领域特定滤波器（如雾化滤镜），其实现的变异可能受限且易出错。当前最优方法借助深度生成模型生成（无限）输入，同时为保持变异输入的感知有效性（例如"猫"变异后仍为"猫"），现有研究依赖不精确且泛化性较弱的启发式规则。本研究基于流形理论——一种在低维空间捕获高维媒体数据感知特性的成熟理论——重新严格审视媒体输入变异的两大目标：感知多样性（DIV）与有效性（VAL）。我们证明DIV与VAL存在不可分割的相互约束关系，并论证基于生成模型的最新方法从根本上无法实现真实世界媒体数据的有效变异（必然牺牲DIV或VAL）。反之，我们探讨了基于流形实现兼具高DIV与高VAL的真实世界媒体数据变异的可行性，并具体实现了一种统一框架，通过将不同格式（图像、音频、文本）的媒体数据投影至低维流形，沿特定方向与步长进行流形行走以完成变异。实验表明，变异后数据在保留合理高VAL（如"狗"仍为"狗"）的同时，其感知特征（如"躺卧狗"与"站立狗"）展现出显著DIV优势。我们基于该技术实现DEEPWALK框架用于DNN测试，其在测试全面性上超越现有方法，且能发现更多高质量的错误触发输入。