Latent Space Reinforcement Learning for Inverse Material Estimation in Food Fracture Simulation

Realistic visual simulation of food manipulation requires accurate material parameters, yet these are difficult to measure directly and vary across the heterogeneous regions of a single food item. We address the inverse problem of estimating material parameters from a target description of fracture behavior in a non-differentiable continuum damage mechanics simulator. Using orange peeling as a test case, we train a neural surrogate on 2,000 forward simulations and compare Covariance Matrix Adaptation Evolution Strategy (CMA-ES, a gradient-free evolutionary optimizer) with Proximal Policy Optimization (PPO, a reinforcement learning algorithm) across the original 9-dimensional parameter space and two learned 4-dimensional latent representations. Since different oranges have different material properties, a practical inverse system must handle arbitrary targets without retraining. We train a goal-conditioned PPO policy that learns a general inverse mapping: given any target description of peeling behavior, the policy produces a material parameter estimate in a single forward pass (8 surrogate evaluations, approximately 10ms). Operating in a normalizing flow latent space with a shared surrogate evaluator, the goal-conditioned policy achieves 0.642 actual recovery when validated through the simulator, outperforming the original parameter space by 23%. A warm-start extension that initializes CMA-ES refinement from the policy's output further improves recovery to 0.828 with 540 evaluations. These findings provide a practical framework for inverse food physics and lay groundwork for vision-driven material identification from video observations of food manipulation.

翻译：食物操作的真实视觉模拟需要准确的材料参数，然而这些参数难以直接测量，并且在单个食品的不同异质区域中变化很大。我们解决了在不可微分的连续损伤力学模拟器中，根据断裂行为的目标准则估计材料参数的反演问题。以橘子剥皮为测试案例，我们在2000次正向模拟上训练了一个神经代理模型，并比较了协方差矩阵自适应进化策略（CMA-ES，一种无梯度进化优化器）与近端策略优化（PPO，一种强化学习算法）在原始9维参数空间和两个学习到的4维潜空间表示上的表现。由于不同橘子具有不同的材料属性，一个实用的反演系统必须能在无需重新训练的情况下处理任意目标。我们训练了一个目标条件化的PPO策略，该策略学习通用的逆映射：给定任意剥皮行为的目标准则，该策略通过一次正向传播（8次代理模型评估，约10毫秒）生成材料参数估计。在共享代理模型评估器的归一化流潜空间中运行时，目标条件化策略通过模拟器验证的实际恢复率达到0.642，相比原始参数空间提升23%。一种热启动扩展方法，即从策略输出初始化CMA-ES精细化优化，将恢复率进一步提升至0.828（共540次评估）。这些发现为反演食品物理学提供了实用框架，并为从视频观测食物操作中进行视觉驱动的材料识别奠定了基础。