ProxyImg：基于层次化解耦代理嵌入的高可控图像表示方法 (ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding)

Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.

翻译：当前主流的图像表示方法，无论是显式表示（如栅格图像和高斯图元）还是隐式表示（如潜在图像），均存在明显局限：显式表示因表示冗余导致人工编辑负担沉重；隐式表示则缺乏从潜在变量到语义实例或部件的直接映射，致使细粒度操控困难。这些限制阻碍了高效可控的图像与视频编辑。为解决上述问题，本文提出一种基于层次化代理的参数化图像表示方法，将语义、几何与纹理属性解耦至独立且可操控的参数空间。该方法通过对输入图像进行语义感知分解，构建层次化代理几何结构，具体通过自适应贝塞尔曲线拟合及迭代式内部区域细分与网格化实现。多尺度隐式纹理参数被嵌入至生成的几何感知分布式代理节点中，从而在像素域实现连续高保真重建，并支持实例或部件独立的语义编辑。此外，我们引入一种局部自适应的特征索引机制，确保空间纹理一致性，进一步支持不依赖生成模型的高质量背景补全。在ImageNet、OIR-Bench和HumanEdit等图像重建与编辑基准上的大量实验表明，本方法以显著更少的参数实现了最先进的渲染保真度，同时支持直观、交互式且物理合理的操控。进一步地，通过将代理节点与基于位置的动力学（Position-Based Dynamics）相结合，本框架利用轻量级隐式渲染实现了实时物理驱动动画，与生成式方法相比，展现出更优的时间一致性与视觉真实感。