SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SERFN, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SERFN on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SERFN achieves stable, sample-efficient adaptation where standard methods struggle.

翻译：现实世界中灵巧操作策略的微调仍面临挑战，主要源于有限的真实交互预算以及高度多模态的动作分布。基于扩散的策略尽管表达能力丰富，但在微调过程中因动作概率难以处理，无法执行基于保守似然度的更新。相比之下，传统高斯策略在多模态场景下易失效，尤其在动作以分块方式执行时，而标准逐步骤评论员难以与分块执行对齐，导致信用分配不佳。为此，我们提出SERFN——一种样本高效的离策略微调框架，结合归一化流（NF）以应对上述挑战。归一化流策略能够为多模态动作分块提供精确似然度，从而通过似然度正则化实现保守且稳定的策略更新，进而提升样本效率。动作分块评论员可评估完整动作序列，使价值估计与策略的时间结构对齐，改善长时域信用分配。据我们所知，这是首次在真实机器人硬件上展示基于似然度的多模态生成策略与分块级价值学习相结合的方法。我们在两项具有挑战性的现实世界灵巧操作任务上评估了SERFN：从盒中取出剪刀并切割胶带，以及手掌朝下抓取时进行手内立方体旋转——两者均需在长时域中进行精确灵巧控制。在这些任务中，SERFN实现了稳定且高效的样本自适应，而标准方法则难以胜任。