【arxiv_20250226v1】Hi Robot：分层策略 ( VLM + VLA ) + 合成数据〔适用：更复杂的提示，整合反馈〕-EW帮帮网

situated 状况相关的

https://arxiv.org/abs/2502.19417

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

在这里插入图片描述

Hi Robot 使机器人能够遵循多阶段指令，适配实时修正和约束，完成没见过的 long-horizon 任务，并在需要时进行口头回应。

策略	高层级策略 (推理)	低层级策略 (控制)
作用	图像观测和用户命令 —> 口头响应和技能标签整合用户反馈；向习得的低层级策略提供 atomic 命令处理开放式任务提示，并为低层级推理过程输出这些`低层级语言命令` 处理开放式指令和来自底座和腕部摄像头的图像，以生成低层级语言命令。	响应更简单的`低层级语言命令` 生成动作块使用这些命令、图像和机器人状态来生成动作和可选的口头响应
模型	视觉-语言模型 VLM	视觉-语言-动作（VLA）模型
优势情形	提示 $\ell_t$ 太复杂，低层级策略无法解析，或者在机器人数据的环境中太不熟悉，或者涉及与用户的复杂交互。	简单和熟悉的任务
0	接受整体任务提示 $\ell_t$ 和以图像和用户交互的形式的相应环境，并将其转换为低层级策略可理解的、适合机器人此时执行的任务，用 $\hat \ell_t$ 表示。
0	更审慎的系统 2 则涉及更高层级的推理，以解析复杂的 long-horizon 任务，解析反馈，并决定适当的行动方案。	“自动的”系统 1 对应于能够通过触发预先习得的技能来执行直接的命令的策略
0	慎重的 “系统 2” 层采用高层级 VLM 策略的形式，利用 web 规模预训练的语义和视觉知识，通过复杂的提示和用户交互进行推理。	物理的、反应性的 “系统 1” 层也采用 VLM 的形式，经过训练可以直接输出机器人动作，以响应描述 atomic 行为的简单命令。两个 VLMs 具有几乎相同的架构 (`PaliGemma-3B VLM` ），唯一的区别是低层级策略使用流匹配来输出动作。

在这里插入图片描述

〔🟩 按照这个思路：在混合架构中互补（如 Flow Matching 生成粗解，Diffusion Policy 细化）。结合本文的分层策略，是不是在低层级策略那里在流匹配后增加 Diffusion Policy 或变体进一步 refine actions 建模，会好些？
〕

摘要

Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution.
能够在开放世界环境中执行一系列不同任务的 Generalist robots 不仅必须能够 推理完成目标所需的步骤，还必须能够在任务执行过程中处理复杂的指令、提示甚至反馈。
Intricate instructions (e.g., "Could you make me a vegetarian sandwich?” or “I don’t like that one”) require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world.
复杂的指令 (例如，“你能给我做一个素食三明治吗？” 或 “我不喜欢那个”) 不仅需要能够实际执行单独的步骤，还需要能够将复杂的命令和反馈置于物理世界中。
In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions.
在这项工作中，我们描述了一个系统，该系统在分层结构中使用视觉-语言模型，首先对复杂的提示和用户反馈进行推理，以推断出完成任务的最合适的下一步，然后用低层级动作执行该步骤。
In contrast to direct instruction following methods that can fulfill simple commands (“pick up the cup”), our system can reason through complex prompts and incorporate situated feedback during task execution (“that’s not trash”).
相比于可以完成简单命令（“拿起杯子”）的直接指令遵循方法，我们的系统可以通过复杂的提示进行推理，并在任务执行过程中整合 situated 反馈（“这不是垃圾”）。
We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping.
我们在三个机器人平台上评估了我们的系统，包括单臂、双臂和双臂移动机器人，演示了它处理诸如清理凌乱的桌子、制作三明治和购买杂货等任务的能力。

1. 引言

↓ 【目标】

A defining feature of intelligence is its flexibility: people not only excel at complex tasks but also adapt to new situations, modify behaviors in real time, and respond to diverse inputs, corrections, and feedback.
智能的一个决定性特征是它的灵活性：人们不仅擅长复杂任务，还能适应新情境，实时纠正行为，并回应各种输入、修正和反馈。
Achieving this kind of flexibility is essential for robots in open-ended, human-centric environments.
在开放式的、以人为中心的环境中，实现这种灵活性对于机器人至关重要。
For instance, consider a robot tasked with tidying up a table after a meal: instead of rigidly following a single predefined set of steps, the robot would need to interpret dynamic prompts like “only take away someone’s dishes if they are done eating,” respond to corrections like “leave it alone,” and adapt when faced with unfamiliar challenges, such as a delicate object that requires special handling.
例如，假设一个机器人的任务是在饭后收拾桌子：机器人不需要严格遵循单一的预定义步骤集，而是需要解析动态提示，比如“只有在别人吃完饭的时候才把盘子拿走”，回应类似于 “不要碰它” 的修正，并在面对不熟悉的挑战时进行适应，比如需要特殊处理的精细物体。
This paper aims to advance robotic intelligence by enabling robots to interpret and act on diverse natural language commands, feedback, and corrections – a step towards creating agents that reason through tasks, integrate human feedback seamlessly, and operate with human-like adaptability.
本文旨在通过 使机器人能够根据各种自然语言命令、反馈和修正解析并采取行动来推进机器人智能 —— 这是朝着创造能够通过任务推理、无缝整合人类反馈并以类似人类的适应性操作的 agents 迈出的一步。
If we can enable a robot to process and engage with complex natural language interaction, we can unlock not only better instruction following, but also the ability for users to guide a robot through new tasks and correct the robot in real time.
如果我们能够让机器人处理和参与复杂的自然语言交互，我们不仅可以解锁更好的指令遵循，还可以让用户引导机器人完成新任务并实时纠正机器人。

↓ 【工作难点】

Achieving this level of flexibility and steerability in robotic systems is challenging.
在机器人系统中实现这种水平的灵活性和可操控性是具有挑战性的。
While standard language-conditioned imitation learning can follow simple, atomic instructions such as “pick up the coke can” (Brohan et al., 2022), real-world tasks are rarely so straightforward.
虽然标准的语言-条件模仿学习可以遵循简单的 atomic 指令，如 “拿起可乐罐”（Brohan et al., 2022），但现实世界的任务很少如此简单。
Imagine a more realistic prompt, such as: “Could you make me a vegetarian sandwich? I’d prefer it without tomatoes. Also, if you have ham or roast beef, could you make a separate sandwich with one of those for my friend?”
想象一个更现实的提示，比如：“你能给我做一个素食三明治吗？我不喜欢吃西红柿。还有，如果你有火腿或烤牛肉，你能给我的朋友单独做一个三明治吗？”
This requires not only understanding the language, but also the ability to situate commands within the current context and compose existing skills (e.g., picking up the roast beef) to solve a new task.
这不仅需要理解语言，还需要在当前上下文中定位命令并组合现有技能（例如，拿起烤牛肉）来解决新任务的能力。
If the robot further receives corrections and feedback (“that’s not how you do it, you have to get lower, otherwise you’ll keep missing”), these must also be integrated dynamically into task execution.
如果机器人进一步接收到修正和反馈（“你不能这样做，你必须更低，否则你会一直错过”），这些也必须动态地整合到任务执行中。
This challenge resembles the distinction between Kahneman’s “System 1” and “System 2” cognitive processes (Kahneman, 2011).
这一挑战类似于 Kahneman 的 “系统 1” 和 “系统 2” 认知过程之间的区别（Kahneman, 2011）。
The "automatic” System 1 corresponds to a policy capable of executing straightforward commands by triggering pre-learned skills, while the more deliberative System 2 involves higher-level reasoning to parse complex long-horizon tasks, interpret feedback, and decide on an appropriate course of action.
“自动的”系统 1 对应于能够通过触发预先习得的技能来执行直接的命令的策略，而更审慎的系统 2 则涉及更高层级的推理，以解析复杂的 long-horizon 任务，解析反馈，并决定适当的行动方案。
Prior work in robotic instruction following has largely focused on atomic instructions (Stepputtis et al., 2020; Jang et al., 2022; Brohan et al., 2022), addressing only System 1-level behaviors.
机器人指令遵循的先前工作主要集中在 atomic 指令上 (Stepputtis 等人，2020；Jang et al., 2022；Brohan 等人，2022)，只处理系统 1 层级的行为。

↓ 【 idea 要点】

In this paper, we address the more intricate reasoning needed for complex prompts and feedback by introducing a hierarchical reasoning system for robotic control based on vision-language models (VLMs).
在本文中，我们通过引入基于视觉-语言模型（VLMs）的机器人控制分层推理系统来解决复杂提示和反馈所需的更复杂的推理。
In our system, the robot incorporates complex prompts and language feedback using a VLM, which is tasked with interpreting the current observations and user utterances, and generating suitable verbal responses and atomic commands (e.g., “grasp the cup”) to pass into the low-level policy for execution.
在我们的系统中，机器人使用 VLM 结合复杂的提示和语言反馈，该 VLM 的任务是解析当前的观测和用户话语，并生成合适的口头响应和 atomic 命令（例如，“抓住杯子”），以传递到低层级策略中执行。
This low-level policy is itself a vision-language model finetuned for producing robotic actions, also known as a vision-language-action (VLA) model (Black et al., 2024; Brohan et al., 2023a; Kim et al., 2024; Wen et al., 2024).
这种低层级策略本身就是一种用于产生机器人动作的视觉-语言模型，也称为视觉-语言-动作（VLA）模型 (Black 等人，2024；Brohan 等人，2023a；Kim et al., 2024；文等人，2024)。
We expect that robot demonstrations annotated with atomic commands will not be sufficient for training the high-level model to follow complex, open-ended prompts, and we therefore need representative examples of complex prompt following.
我们期望用 atomic 命令标注的机器人演示不足以训练高层级模型遵循复杂的开放式提示，因此我们需要 复杂提示遵循的代表性示例。
To acquire this data, we propose to synthetically label datasets consisting of robot observations and actions with hypothetical prompts and human interjections that might have been plausible for that situation.
为了获得这些数据，我们建议合成标记由机器人观测和动作组成的数据集，并使用假设的提示和可能对该情况合理的人类插话。
To this end, we provide a state-of-the-art vision-language model with a robot observation and target atomic command, and ask it to come up with a prompt or human interaction that may have preceded that observation and command, i.e. generating high-level policy prompts for different outcomes.
为此，我们提供了一个具有机器人观测和目标 atomic 命令的最先进的视觉-语言模型，并要求它在观测和命令之前找出提示或人类交互，即为不同结果生成高层级策略提示。
By incorporating these synthetically-generated but situated examples into high-level policy training, our approach generalizes to diverse prompts and interjections while maintaining grounding in the robot’s capabilities.
通过将这些合成生成的 situated 实例整合到高层级策略的训练中，我们的方法可以推广到不同的提示和插话，同时保持机器人能力的 grounding。

↓ 【贡献】

The main contribution of our paper is a hierarchical interactive robot learning system (Hi Robot), a novel framework that uses VLMs for both high-level reasoning and low-level task execution.
本文的主要贡献是一个分层交互式机器人学习系统（Hi robot），这是一个使用 VLMs 进行高层级推理和低层级任务执行的新框架。
We show that our framework enables a robot to process much more complex prompts than prior end-to-end instruction following systems and incorporate feedback during task execution (Figure 1).
我们展示了我们的框架使机器人能够处理比之前的端到端指令遵循系统复杂得多的提示，并在任务执行期间整合反馈（图 1）。
While some of the individual components of this system, such as the low-level VLA policy, have been studied in prior work, the combination of these components along with our synthetic data generation scheme are novel and enable novel capabilities.
虽然该系统的一些单独组件（如低层级 VLA 策略）已经在先前的工作中进行了研究，但这些组件与我们的合成数据生成方案的组合是新颖的，且实现了新颖的功能。
We evaluate Hi Robot on diverse robots, including single-arm, dual-arm, and mobile platforms.
我们在不同的机器人上评估 Hi Robot，包括单臂、双臂和移动平台。
Our evaluation requires the robots to perform a variety of tasks, including new combinations of skills seen during training, in the context of scenarios that span cleaning of messy tables, making sandwiches, and grocery shopping.
我们的评估要求机器人执行各种任务，包括在训练中见过的技能的新组合，清理凌乱的桌子、制作三明治和购买杂货这些场景。
Our experiments show that Hi Robot surpasses multiple prior approaches, including using API-based VLMs and flat VLA policies, in both alignment with human intent and task success.
我们的实验表明，Hi Robot 在对齐人类意图和任务成功方面超越了多种先前的方法，包括使用基于 API 的 VLMs 和 flat VLA 策略。〔哪些方法比不过？〕
By grounding high-level reasoning in both verbal and physical interaction, Hi Robot paves the way for more intuitive and steerable human-robot symbiosis, advancing the potential for flexible intelligence in real-world applications.
通过在口头和物理交互中建立高层级推理基础，Hi Robot 为更直观和可操控的人机合作铺平了道路，促进了现实世界应用中灵活智能的潜力。

在这里插入图片描述

Figure 1: Open-ended instruction following.
图 1：开放式指令遵循。
Hi Robot enables robots to follow multi-stage instructions, adapt to real-time corrections and constraints, complete unseen long-horizon tasks, and respond verbally when needed.
Hi Robot 使机器人能够遵循多阶段指令，适配实时修正和约束，完成没见过的 long-horizon 任务，并在需要时进行口头回应。

2. 相关工作

↓ 【当前有哪些方法 (核心 idea + 优缺点 ) + 我们的方法 (核心 idea + 优点 ) 】

Our work relates to research on VLMs for robotic control, which we can categorize into two groups: directly training VLMs for robotic control and using VLMs out-of-the-box with pre-defined robot skills.
我们的工作涉及用于机器人控制的 VLMs 研究，我们可以将其分为两组：直接训练 VLMs 用于机器人控制 和 使用具有预定义机器人技能的开箱即用 VLMs。
In the former category, methods fine-tune VLMs to output robotic controls based on input images and language commands (Brohan et al., 2023a; Wen et al., 2024; Kim et al., 2024; Black et al., 2024; Liu et al., 2024c; Li et al., 2024; O’Neill et al., 2024; Zawalski et al., 2024; Zheng et al., 2025; Pertsch et al., 2025).
在前一类中，方法根据输入图像和语言命令微调 VLMs 以输出机器人控制(Brohan et al., 2023a；Wen et al., 2024；Kim et al., 2024；Black et al., 2024；Liu et al., 2024c；Li et al., 2024；O’neill et al., 2024；Zawalski 等人，2024；郑等，2025；Pertsch et al., 2025)。
While such methods have demonstrated impressive generalization and instruction-following, they are trained for relatively simple commands (“put the cup on the plate”).
虽然这些方法已经展示了令人印象深刻的泛化和指令遵循，但它们被训练用于相对简单的命令（“把杯子放在盘子上”）。
In contrast, we demonstrate tasks with intricate prompts and human interactions that require situated reasoning.
相比之下，我们演示了需要情境推理situated reasoning ，具有复杂提示和人类互动的任务。

In the latter category, a number of methods use LLMs and VLMs to reason over robot observations and commands, and break up multi-stage tasks into simpler steps that can be performed by low-level controllers.
在后一类中，许多方法使用 LLMs 和 VLMs 对机器人的观测和命令进行推理，并将多阶段任务分解为可以由低层级控制器执行的更简单的步骤。
Earlier methods of this sort used language models in combination with various learned or hand-designed skills (Huang et al., 2022; Brohan et al., 2023b; Liang et al., 2023; Shah et al., 2024; Singh et al., 2023; Wang et al., 2024), but such systems have limited ability to incorporate complex context, such as image observations, into the reasoning process.
这种类型的早期方法使用 语言模型 结合各种习得的或手工设计的技能 (Huang et al., 2022；Brohan 等人，2023b；Liang 等，2023；Shah et al., 2024；Singh et al., 2023；Wang et al., 2024)，但这种系统将复杂背景（如图像观测）纳入推理过程的能力有限。【方式 1 核心 idea + 缺点】
More recently, multiple works have use VLMs to output parameters for pre-defined robotic skills (Huang et al., 2023; Liu et al., 2024a; Nasiriany et al., 2024; Chen et al., 2024; Liu et al., 2024b; Stone et al., 2023; Qiu et al., 2024; Zhi et al., 2024).
最近，许多工作使用 VLMs 来输出预定义机器人技能的参数 (Huang 等人，2023；刘等，20124a；Nasiriany et al., 2024；Chen et al., 2024；刘等，20124b；Stone et al., 2023；邱等，2024；Zhi et al., 2024)。【方式 2 核心 idea 】
Such methods can process more complex commands and situate them in the context of visual observations, but these approaches have shown limited physical dexterity and limited ability to incorporate real-time language interaction with humans (with some exceptions discussed below).
这些方法可以处理更复杂的命令，并将它们置于视觉观测的环境中，但这些方法显示出有限的物理灵活性和有限的与人类进行实时语言交互的能力（下面将讨论一些例外）。【方式 2 优缺点】
In contrast, our system utilizes VLMs for both high-level reasoning and low-level control, with a flexible language interface between the two.
相比之下，我们的系统利用 VLMs 进行高层级推理和低层级控制，两者之间有灵活的语言接口。【我们的方式核心 idea 】
These design choices, along with a new synthetic data generation scheme, allow our system to achieve both significant physical dexterity and detailed prompt ability that prior works lack.
这些设计选择，以及新的合成数据生成方案，使我们的系统实现了以前的工作所缺乏的显著的物理灵活性和详细的提示能力。【我们的方式优点】

Many works aim to enable robotic language interaction with users, including model-based systems that parse language instructions and feedback and ground them via a symbolic representation of the scene (Swadzba et al., 2009; Matuszek et al., 2013; Namasivayam et al., 2023; Patki et al., 2019), and more recent learning-based methods that process feedback directly, typically with a hierarchical architecture (Liu et al., 2023; Xiao et al., 2024; Shi et al., 2024; Belkhale et al., 2024; Singh et al., 2024; McCallum et al.; Driess et al., 2023; Dai et al., 2024).
很多工作旨在使机器人与用户进行语言交互，包括基于模型的系统，该系统解析语言指令和反馈，并通过场景的符号表示将其 ground ( Swadzba 等人，2009；Matuszek et al., 2013；Namasivayam et al., 2023；Patki 等人，2019)，以及最近通常采用分层架构直接处理反馈的基于学习的方法 (Liu 等人，2023；肖等，2024；Shi et al., 2024；Belkhale et al., 2024；Singh et al., 2024；McCallum 等人；Driess et al., 2023；Dai 等人，2024)。
Our work builds on the latter class of methods, where user feedback is incorporated via a high-level policy that provides atomic commands to a learned low-level policy.
我们的工作基于后一类方法，其中用户反馈通过 高层级策略整合，该策略向习得的低层级策略提供 atomic 命令。
Unlike OLAF (Liu et al., 2023), which uses an LLM to modify robot trajectories, our approach can incorporate situated corrections based on the robot’s observations, respond to those corrections in real time, and follow complex prompts describing dexterous manipulation tasks.
与 OLAF （Liu et al., 2023）不同，OLAF 使用 LLM 来修改机器人轨迹，我们的方法可以根据机器人的观测整合 situated 修正，实时响应这些修正，并遵循描述灵巧操作任务的复杂提示。【与同类工作 1 的区别】〔 ✅ 这里的改进点是？ Hi Robot 换成了 VLM，因而可即时响应用户反馈〕
While YAY Robot (Shi et al., 2024) can handle situated real-time corrections, it is limited to one prompt and to the corrections seen in the human-written data; our approach leverages VLMs and a new data generation scheme to enable diverse prompts and open-ended corrections.
虽然 YAY Robot （Shi et al., 2024）可以处理 situated 实时修正，但它仅限于一个提示和人类编写数据中见过的修正；我们的方法利用 VLMs 和一个新的数据生成方案来实现各种提示和开放式修正。【同类工作 2 的不足 + 我们的改进方案】
Finally, RACER (Dai et al., 2024) can also incorporate situated corrections, but relies on a physics simulator to construct recovery behaviors; our approach only uses real robot demonstrations without intentional perturbations or corrections and is applicable to open-ended prompts.
最后，RACER （Dai et al., 2024）也可以整合 situated 修正，但依赖物理模拟器来构建恢复行为；我们的方法只使用真实的机器人演示，没有故意的扰动或修正，适用于开放式提示。【同类工作 3 的不足 + 我们的改进方案】

—————— 补充 Start
https://arxiv.org/abs/2310.17555
OLAF 的一个关键特征是它能够基于口头反馈更新机器人的视觉运动神经策略，以避免未来重复错误。
OLAF：第一个可以使用普通非专业用户的口头修正来更新视觉运动神经网络策略的学习系统。OLAF 使用 LLM 将口头修正转换为低层级动作标签，以合成用于更新策略的数据集。
当前的 OLAF 设计有几个限制。首先，尽管我们使用 OLAF 来训练基于 transformer 的视觉运动策略，但 LLM 需要文本化的状态估计来重新标记动作。此外，我们需要手工制作与任务相关的提示，以便 LLM 理解状态信息。

在这里插入图片描述

Fig. 2: OLAF System.
The OLAF pipeline consists of three steps: User Interaction, Data Synthesis, and Policy Update.
OLAF pipeline 包括 3 步：用户交互、数据合成和策略更新。
In User Interaction, it collects pairs of ⟨robot trajectory, verbal correction⟩ of trajectories stopped by the user.
在用户交互中，它收集由用户停止的轨迹的 ⟨机器人轨迹，口头纠正⟩ 对。
In Data Synthesis, it uses the LLM as a critic to select the action (from a pool of action candidates) that best matches the user’s verbal correction and relabels the pre-intervention trajectory segments (in red).
在数据合成中，它使用 LLM 作为 critic 来选择最符合用户口头纠正的动作（从动作候选池中），并重新标记干预前的轨迹段（红色）。
In Policy Update, it updates the policy by performing behavior cloning on the newly synthesized data and the previously collected data.
在 Policy Update 中，它通过对新合成的数据和以前收集的数据执行行为克隆来更新策略。

—————— 补充 End

3. 预备知识和问题陈述

习得的策略通过处理观测输入来控制机器人，我们将其表示为 ${\bf o}_t$ ，并产生一个或多个动作 ${\bf A}_t = [{\bf a}_t, {\bf a}_{t +1}, \cdots, {\bf a}_{t +H-1}]$ ，其中我们使用 ${\bf A}_t$ 表示由后续 $H$ 个动作组成的动作块（Zhao et al., 2023）。
我们的系统将来自多个摄像头的图像 ${\bf I}_t^1,\cdots,{\bf I}_t^n$ ，机器人的配置（即关节和抓取器位置） ${\bf q}_t$ ，以及语言提示 $\ell_t$ 。
因此，我们有 ${\bf o}_t = [{\bf I}_t^1,\cdots,{\bf I}_t^n,\ell_t, {\bf q}_t]$ ，策略表示分布 $p({\bf A}_t|{\bf o}_t)$ 。
之前的工作已经提出了各种方法来表示和训练这些策略 (Zhao et al., 2023；Chi et al., 2023；Octo 模型团队等，2024；Pertsch et al., 2025)。

Since our focus will be specifically on complex, multi-stage tasks that require parsing intricate prompts and even dynamic user feedback, we need our policies to be able to interpret complex language and ground it via observations of the environment.
由于我们的重点将特别放在复杂的、多阶段的任务上，这些任务需要解析复杂的提示，甚至动态的用户反馈，因此我们需要我们的策略能够解释复杂的语言，并通过对环境的观测来建立ground 它。
A particularly powerful approach for handling such complex semantics is provided by vision-language-action (VLA) models (Black et al., 2024; Brohan et al., 2023a; Kim et al., 2024; Wen et al., 2024), which use vision-language model (VLM) pre-training to initialize the policy $p({\bf A}_t|{\bf o}_t)$ .
处理这种复杂语义的一种特别强大的方法是由视觉-语言-动作（VLA）模型提供的 (Black 等人，2024；Brohan 等人，2023a；Kim et al., 2024；Wen 等人，2024)，使用视觉-语言模型（VLM）预训练初始化策略 $p({\bf A}_t|{\bf o}_t)$ 。
VLM 是一种语言模型，也经过训练来处理图像输入，它表示一个分布 $p(\ell ' |{\bf I}, \ell)$ —— 语言后缀 $\ell '$ （例如，问题的答案）响应由图像 $\bf I$ 和提示 $\ell$ （例如，视觉问题）组成的图像-语言前缀的概率。
最常用的 VLMs 通过一个 自回归的仅解码器 Transformer 模型 表示 $p(\ell' |{\bf I}, \ell)$ ，将分布分解为自回归 token 概率 $p({\bf x}_{t+1}|{\bf x}_1,\cdots, {\bf x}_t,{\bf I})$ 的乘积，其中 ${\bf x}_t$ 表示第 $t$ 个 token（不要与物理时间步混淆），我们有 $\ell = [{\bf x}_1,\cdots,{\bf x}_{t_p}]$ 和 $\ell' = [{\bf x}_{t_p+1},\cdots,{\bf x}_{t_p+\textcolor{blue}{t_s}}]$ ，其中 $t_p$ 为前缀长度， $t_s$ 为后缀长度（Beyer et al., 2024）。
我们也使用这种基于 Transformer 的 VLMs，但是由于我们不修改它们的架构，因此它们的自回归结构与我们的讨论无关，我们将使用更简洁的 $p(\ell' |{\bf I}, \ell)$ 记法来表示标准 VLM。

（Beyer et al., 2024）来自 Google DeepMind 的工作 PaliGemma: A versatile 3B VLM for transfer

标准的 VLA 是通过对 VLM $p(\ell' |{\bf I}, \ell)$ 进行微调而产生的，这样动作 ${\bf A}_t$ 由后缀 $\ell'$ 中的 tokens 表示，通常是通过离散化来 tokenizing 动作。基于 $π_0$ VLA （Black et al., 2024），
which additionally handles multiple images and continuous state observations ${\bf q}_t$ , and modifies the VLM to output continuous action chunk distributions via flow-matching, but the high-level principles are similar.
它额外处理多个图像和连续状态观测 ${\bf q}_t$ ，并修改 VLM 以通过流匹配输出连续的动作块分布，但高层级原理是相似的。
While such VLA models can follow a wide variety of language prompts (Brohan et al., 2023a), by themselves they are typically limited to simple and atomic commands, and do not handle the complex prompts and feedback that we study in this paper.
虽然这样的 VLA 模型可以遵循各种各样的语言提示（Brohan 等人，2023a），但它们本身通常仅限于简单和 atomic 命令，并且不处理我们在本文中研究的复杂提示和反馈。〔 $π_0$ VLA 〕

4. Hi Robot

We provide an overview of our method in Figure 2.
我们在图 2 中概述了我们的方法。
Our approach decomposes the policy $p({\bf A}_t|{\bf o}_t)$ into a low-level and high-level inference process, where the low-level policy consists of a VLA that produces the action chunk ${\bf A}_t$ in response to a simpler, low-level language command, and the high-level policy consists of a VLM that processes the open-ended task prompt, and outputs these low-level language commands for the low-level inference process.
我们的方法将策略 $p({\bf A}_t|{\bf o}_t)$ 分解为低层级和高层级推理过程，其中低层级策略由一个 VLA 组成，该 VLA 响应更简单的低层级语言命令 生成动作块 ${\bf A}_t$ ，高层级策略由一个 VLM 组成，该 VLM 处理开放式任务提示，并为低层级推理过程输出这些低层级语言命令。
The two processes run at different rates: the low-level process produces action chunks at a high frequency, while the high-level process is invoked less often, either after a set time or upon receiving new language feedback.
这两个流程以不同的速率运行：低层级流程以高频率产生动作块，而高层级流程调用的频率较低，要么在设定的时间之后，要么在接收到新的语言反馈后。
Thus, the high-level process essentially “talks” to the low-level process, breaking down complex prompts and interactions into bite-sized commands that can be converted into actions.
因此，高层级流程本质上与低层级流程“对话”，将复杂的提示和交互分解为可转换为动作的小命令。

在这里插入图片描述

Figure 2: Overview of hierarchical VLA.
图 2：分层 VLA 概述。
The policy consists of a high-level and a low-level policy.
策略由高层级策略和低层级策略组成。
The high-level policy processes open-ended instructions and images from base and wrist-mounted cameras to generate low-level language commands.
高层级策略处理 开放式指令 和来自底座和腕部摄像头的图像，以生成低层级语言命令。
The low-level policy uses these commands, images, and robot states to produce actions and optionally verbal responses.
低层级策略使用这些命令、图像和机器人状态来生成动作和可选的口头响应。

4.1 用 VLAs 进行分层推理

在形式上，高层级策略 $p^{\rm hi} (\textcolor{blue}{\hat \ell_t}|{\bf l}_t^1,\cdots, {\bf l}_t^n, \ell_t)$ 接收图像观测和一个开放式提示 $\ell_t$ ，并产生一个中间语言命令 $\hat \ell_t$ 。
低层级策略 $p^{\rm lo} ({\bf A}_t|{\bf l}_t^1,\cdots, {\bf l}_t^n, \textcolor{blue}{\hat \ell_t}, {\bf q}_t)$ 接收与第 3 节中描述的标准 VLA 相同类型的观测值。
except that the language command $\ell_t$ is replaced by the output from the high-level policy $\hat \ell_t$ .
只不过语言命令 $\ell_t$ 被替换为高层级策略的输出 $\hat \ell_t$ 。
Thus, following the System 1/System 2 analogy, the job of the high-level policy is to take in the overall task prompt $\ell_t$ and accompanying context, in the form of images and user interactions, and translate it into a suitable task for the robot to do at this moment, represented by $\hat \ell_t$ , that the low-level policy is likely to understand.
因此，按照系统 1 / 系统 2 的类比，高层级策略的工作是接受整体任务提示 $\ell_t$ 和以图像和用户交互的形式的相应环境，并将其转换为低层级策略可理解的、适合机器人此时执行的任务，用 $\hat \ell_t$ 表示。
Of course, for simple and familiar tasks, this is not necessary — if we simply want the robot to perform a task that the low-level policy was directly trained for, we could simply set $\hat \ell_t=\ell_t$ and proceed as in prior work (Brohan et al., 2022).
当然，对于简单和熟悉的任务，这是没有必要的 —— 如果我们只是想让机器人执行低层级策略直接训练过的任务，我们可以简单地设置 $\hat \ell_t=\ell_t$ ，然后像之前的工作一样继续（Brohan et al., 2022）。
The benefit of this hierarchical inference process is in situations where either the prompt $\ell_t$ is too complex for the low-level policy to parse, too unfamiliar in the context of the robot data, or involves intricate interactions with the user.
这种分层推理过程在以下情形中受益：提示 $\ell_t$ 太复杂，低层级策略无法解析，或者在机器人数据的环境中太不熟悉，或者涉及与用户的复杂交互。

The high-level policy is represented by a VLM that uses the images and $\ell_t$ as the prefix, and produces $\hat \ell_t$ as the suffix.
高层级策略由一个 VLM 表示，该 VLM 使用图像和 $\ell_t$ 作为前缀，并生成 $\hat \ell_t$ 作为后缀。
We describe how this model is trained in Section 4.3.
我们将在第 4.3 节中描述如何训练该模型。

Since high-level inference is slower but also less sensitive to quick changes in the environment, we can comfortably run it at a lower frequency.
由于高层级推理速度较慢，对环境的快速变化也不太敏感，因此我们可以以较低的频率运行它。
A variety of strategies could be used to instantiate this, including intelligent strategies where the system detects when the command $\hat \ell_t$ has been completed before inferring the next suitable command.
可以使用各种策略来实例化这一点，包括智能策略，其中系统在推断下一个合适的命令之前检测命令 $\hat \ell_t$ 何时完成。
In our implementation, we found a very simple strategy to work well: we rerun high-level inference and recompute $\hat \ell_t$ either when one second has elapsed, or when a new interaction with the user takes place.
在我们的实现中，我们发现一个非常简单的策略可以很好地工作：我们重新运行高层级推理并在一秒钟过去或与用户发生新的交互时重新计算 $\hat \ell_t$ 。
This provides reactive behavior when the user provides feedback or corrections, while maintaining simplicity.
当用户提供反馈或修正时，这提供了响应性行为，同时保持了简单性。

4.2. 整合用户交互

The user can intervene at any point during policy execution and provide additional information and feedback, or even change the task entirely.
用户可以在策略执行过程中的任何时候进行干预，并提供额外的信息和反馈，甚至可以完全更改任务。
In our prototype, these interventions take the form of text commands or spoken language (which is then transcribed into text).
在我们的原型中，这些干预采取文本命令或口语的形式（然后转录成文本）。
When the system receives a user intervention, the high-level inference is triggered immediately to recompute $\hat \ell_t$ .
当系统接收到一个用户干预时，会立即触发高层级推理来重新计算 $\hat \ell_t$ 。
The high-level policy has the option to include a verbal utterance $u_t$ in the command $\hat \ell_t$ , which can be confirmations or clarifications from the robot.
高层级策略可以选择在命令 $\hat \ell_t$ 中包含口头话语 $u_t$ ，这可以是机器人的确认或澄清。
When $u_t$ is included, we use a text to speech system to play the utterance to the user, and remove it from $\hat \ell_t$ before passing it into the low-level policy.
当包含 $u_t$ 时，我们使用文本到语音系统向用户播放话语，并在将其传递到低层级策略之前将其从 $\hat \ell_t$ 中删除。

utterances：参考链接
Words and other non-linguistic sounds, which we call fillers (breath, um, uh, cough), form utterances.
单词和其他非语言声音，我们称之为填充物（呼吸，嗯，呃，咳嗽），构成了话语。

When an interjection ("leave it alone”) has been fulfilled, the user can signal to the robot that it may switch back to the previous command and continue the task execution.
当一句插话（“别管它”）完成后，用户可以向机器人发出信号，它可以切换回前面的命令并继续执行任务。
Notably, the responses of the high-level policy are contextual, because it observes not only the prompt $\ell_t$ , but also the current image observations.
值得注意的是，高层级策略的响应是上下文相关的，因为它不仅观测提示 $\ell_t$ ，还观测当前的图像。
Therefore, it can correctly ground feedback like “that’s not trash,” which is not possible with language-only systems.
因此，它可以正确地 ground 诸如“这不是垃圾”之类的反馈，这在仅语言系统中是不可能的。

4.3. 数据收集和训练 Hi Robot

To train Hi Robot in a scalable manner, we employ both human-labeled and synthetically generated interaction data, as illustrated in Figure 3.
为了以可扩展的方式训练 Hi Robot，我们使用人工标记和合成生成的交互数据，如图 3 所示。
First, we collect robot demonstration data ${\cal D}_\text{demo}$ via teleoperation.
首先，我们通过远程操作收集机器人演示数据 ${\cal D}_\text{demo}$ 。
This yields trajectories with coarse language annotations of the overall goal (e.g., make a sandwich).
这将产生带有总体目标的粗糙语言标注的轨迹（例如，做三明治）。
We then segment these full demonstration episodes into short skills, $\hat \ell_t$ , such as pick up one piece of lettuce, which generally last between one and three seconds.
然后，我们将这些完整的演示回合分割成简短的技能 $\hat \ell_t$ ，例如捡起一片生菜，通常持续一到三秒。
We also heuristically extract basic movement primitives (e.g., small corrective motions) such as move the right arm to the left from the raw robot actions.
我们还启发式地提取基本的运动基元（例如，小的修正运动），例如将右臂从原始机器人动作移动到左边。
The resulting dataset ${\cal D}_\text{labeled}$ contains a set of $(\hat \ell_t, {\bf I}_t^1,\cdots,{\bf I}_t^n)$ tuples that describe robot skills.
得到的数据集 ${\cal D}_\text{labeled}$ 包含一组 $(\hat \ell_t, {\bf I}_t^1,\cdots,{\bf I}_t^n)$ 元组，描述机器人技能。

在这里插入图片描述

Figure 3: Data collection and generation for training the high-level policy.
图 3：用于训练高层级策略的数据收集和生成。
We first collect teleoperated robot demonstrations and segment them into short skills (e.g., pick up KitKat).
我们首先收集远程操作的机器人演示，并将它们分成短技能（例如，拿起奇巧）。
Using this labeled data, we prompt a vision-language model (VLM) to generate synthetic user instructions (e.g., “Can you get me something sweet?”) and robot responses.
使用这些带标签数据，我们提示视觉-语言模型（VLM）生成合成的用户指令（例如，“你能给我拿点甜的吗？”）和机器人响应。
The resulting dataset is used to train the high-level policy, which maps image observations and user commands to verbal responses and skill labels.
得到的数据集用于训练高层级策略，该策略将图像观测和用户命令映射到口头响应和技能标签。

接下来，我们使用大型视觉-语言模型（VLM） $p^\text{gen}$ 生成合成的用户提示和插话 $\ell_t$ ，以及相应的机器人话语 $u_t$ 。
给定 ${\cal D}_\text{labeled}$ ，我们用视觉背景 ${\bf I}_t^1,\cdots,{\bf I}_t^n$ 和技能标签 $\hat \ell_t$ （例如，捡起生菜）提示 $p^\text{gen}$ 〔 VLM 〕。
然后， $p^\text{gen}$ 想象一个可能在真实用户交互中得到 $\hat \ell_t$ 〔技能〕的适当交互：它生成可能的用户提示 $\ell_t$ ( 例如： “你能给我加点生菜吗？”），然后机器人会做出口头回应和澄清 $u_t$ 。
我们在附录 A 中详细介绍了合成数据集 ${\cal D}_\text{syn}$ 的生成。

我们在 ${\cal D}_\text{syn} \cup {\cal D}_\text{labeled}$ 上使用交叉熵损失训练高层级策略 $p^{\rm hi} (\textcolor{blue}{\hat \ell_t}|{\bf l}_t^1,\cdots, {\bf l}_t^n, \ell_t)$ 用于 next-token 预测。
为了训练低层级策略 $p^{\rm lo} ({\bf A}_t|{\bf l}_t^1,\cdots, {\bf l}_t^n, \textcolor{blue}{\hat \ell_t}, {\bf q}_t)$ ，我们在 ${\cal D}_\text{labeled} \cup {\cal D}_\text{demo}$ 使用流匹配目标，遵循 Black 等人（2024）。

在这里插入图片描述

4.4. 模型架构和实现

In our implementation, the low-level and high-level policies use the same base VLM as a starting point, namely the PaliGemma-3B VLM (Beyer et al., 2024).
在我们的实现中，低层级和高层级策略使用相同的基础 VLM 作为起点，即 PaliGemma-3B VLM （Beyer et al., 2024）。
The low-level policy is the πο VLA (Black et al., 2024), which is trained by finetuning PaliGemma-3B with an additional flow matching “action expert” to produce continuous actions, while the high-level policy is fine-tuned on the image-language tuples described in Section 4.3 to predict commands.
低层级策略是 $π_0$ VLA (Black et al., 2024)，它通过添加一个额外的流匹配 “动作专家” 来产生连续的动作，对 PaliGemma-3B 进行微调来训练，而高层级策略则通过对第 4.3 节中描述的图像-语言元组进行微调来预测命令。
While we employ $\pi_0$ for our experiments, our framework is inherently modular, allowing for the integration of alternative language-conditioned policies as needed.
虽然我们在实验中使用 $\pi_0$ ，但我们的框架本质上是模块化的，允许根据需要集成其它 language-conditioned 策略。

5. 实验

In our experimental evaluation, we study a range of problems that combine challenging physical interactions with complex user interaction, including multi-stage instructions, live user feedback in the middle of the task, and prompts that describe novel task variations.
在我们的实验评估中，我们研究了一系列问题，这些问题将具有挑战性的物理交互与复杂的用户交互相结合，包括多阶段指令、任务中间的实时用户反馈以及描述新任务变化的提示。
We compare our full method to prior approaches and to alternative designs that use other high-level policy training methods.
我们将我们的完整方法与先前的方法以及使用其它高层级策略训练方法的替代设计进行比较。
The aims of our experiments are: 【 3 个实验目的】
我们实验的目的是：
1、Evaluate the ability of our method to follow a variety of complex textual prompts and live user feedback.
评估我们的方法遵循各种复杂的文本提示和实时用户反馈的能力。
2、Compare our full method to prior approaches that train a flat instruction-following VLA policy or that use foundation models out-of-the-box for high-level reasoning.
将我们的完整方法与之前训练扁平的指令遵循 VLA 策略的方法或 使用开箱即用的基础模型进行高层级推理的方法进行比较。
3、Evaluate the importance of synthetic data and hierarchy for task performance and language following.
评估合成数据和层次结构对任务表现和语言遵循的重要性。

5.1 任务和基线方法

We use three complex problem domains in our experiments, as shown in Figure 4.
我们在实验中使用了 3 个复杂的问题域，如图 4 所示。

Figure 4: Task domains used in our evaluation.
图 4：评估中使用的任务域。
Across three domains, we evaluate complex instructions, intermediate feedback, and user interruptions.
在三个领域中，我们评估了复杂指令、中间反馈和用户打断。
For example, in Table Bussing, when the user says, “that’s not trash,” the robot correctly puts the bowl back down instead of putting it away.
例如，在 Table Bussing 中，当用户说 “那不是垃圾” 时，机器人会正确地把碗放回原位，而不是把它拿走
All images are from policy rollouts.
所有图像都来自策略试运行。

Table bussing involves cleaning up a table, placing dishes and utensils into a bussing bin and trash items into the trash.
Table bussing 包括清理桌子，把盘子和餐具放入洗碗箱，把垃圾放入垃圾桶。
The training data consists of full table cleaning episodes.
训练数据由完整的桌子清理回合组成。
This task is physically challenging because some items require nuanced grasping strategies (e.g., grasping a plate by the edge), the robot must pick up and singulate different objects, and in some cases might even manipulate some objects using others (e.g., picking up a plate with trash on it and tilting the plate to dump the trash into the trash bin).
这项任务在物理上具有挑战性，因为有些物品需要细致入微的抓取策略（例如，抓住盘子的边缘），机器人必须拿起并单一不同的物体，在某些情况下甚至可能使用其他物体来操控某些物体（例如，拿起一个上面有垃圾的盘子，倾斜盘子将垃圾倒入垃圾箱）。
In our evaluation, the robot receives prompts that substantively alter the goal of the task, such as “can you clean up only the trash, but not dishes?”, “can you clean up only the dishes, but not trash?”, and “bus all the yellowish things”.
在我们的评估中，机器人收到的提示会大大改变任务的目标，比如“你能只清理垃圾而不清理盘子吗？”、“你能只清理盘子而不清理垃圾吗？”以及“把所有发黄的东西都清理掉”。
This requires the high-level model to reason about the task and each object (e.g., recognizing that reusable plastic cups are dishes, while paper cups are trash), then modify the robot’s "default” behavior of always putting away all items.
这需要高层级模型对任务和每个对象进行推理（例如，认识到可重复使用的塑料杯是盘子，而纸杯是垃圾），然后修改机器人总是把所有物品收起来的“默认”行为。
This includes understanding what to do and also what not to do (e.g., avoid touching dishes when asked to collect only trash).
这包括了解什么该做，什么不该做（例如，当被要求只收集垃圾时，避免触碰盘子）。
The robot might also receive contextual feedback during the task, such as “this is not trash”, “leave the rest”, or "leave it alone,” which require it to understand the interjection and respond accordingly.
在执行任务的过程中，机器人可能还会收到 contextual 反馈，比如“这不是垃圾”、“留下其余的” 或 “别管它”，这需要机器人理解插话并做出回应。

在这里插入图片描述

Sandwich making requires the robot to make a sandwich, using up to six ingredients as well as bread.
Sandwich making 要求机器人做一个三明治，使用多达 6 种材料和面包。
This task is physically difficult, because the robot has to manipulate deformable and delicate ingredients that have to be grasped carefully and placed precisely.
这项任务在物理上是困难的，因为机器人必须操作易变形和易碎的食材，这些食材必须小心地抓住并精确地放置。
The data contains examples of different types of sandwiches, with segment labels (e.g., “pick up one slice of bread”).
数据包含不同类型的三明治示例，并带有分段标签（例如，“拿起一片面包”）。
We use this task to evaluate complex prompts, such as “hi robot, can you make me a sandwich with cheese, roast beef, and lettuce?” or “can you make me a vegetarian sandwich? I’m allergic to pickles”, and live corrections, like “that’s all, no more”.
我们用这个任务来评估复杂的提示，比如 “嗨，机器人，你能给我做一个有奶酪、烤牛肉和生菜的三明治吗？” 或者 “你能给我做一个素食三明治吗？我对泡菜过敏”，还有现场修正，比如“够了，不要再这样了”。

在这里插入图片描述

Grocery shopping entails picking up a combination of requested items from a grocery shelf, placing them into a basket, and placing the basket on a nearby table.
Grocery shopping 需要从杂货店货架上拿起所需物品的组合，将它们放入篮子中，并将篮子放在附近的桌子上。
This task requires controlling a bimanual mobile manipulator (see Figure 4) and interpreting nuanced semantics that involve variable numbers of objects.
这项任务需要控制一个双臂移动机械手（参见图 4），并解析涉及可变数量对象的细微语义。
Examples of prompts include “hey robot, can you get me some chips? I’m preparing for a movie night”, “can you get me something sweet?”, “can you grab me something to drink?”, “hey robot, can you get me some Twix and Skittles?”, as well as interjections such as “I also want some Kitkat”.
提示的例子包括 “嘿，机器人，你能给我一些薯条吗？我在准备一个电影之夜”，“你能给我拿点甜的吗？”，“你能给我拿点喝的吗？”，“嘿，机器人，你能给我拿点 Twix 〔巧克力品牌〕和 Skittles 〔彩虹糖品牌〕吗？”，以及诸如“我还想要一些 Kitkat〔巧克力品牌〕 ” 之类的插话。

在这里插入图片描述

Comparisons and ablations. 比较和消融
Our comparisons evaluate our full method and a number of alternative approaches, which either employ a different type of high-level strategy, or do not utilize a hierarchical structure.
我们的比较评估了我们的完整方法和许多替代方法，这些方法要么采用不同类型的高层级策略，要么不使用分层结构。
These include:

Expert human high level:
This oracle baseline uses an expert human in place of the high-level model, who manually enters language commands for low-level behaviors that they believe are most likely to succeed at the task.
这个 oracle 基线使用一个人类专家来代替高层级模型，他为他们认为最有可能成功完成任务的低层级行为手动输入语言命令。
This allows us to understand how much performance is limited by the low-level policy, with ideal high-level commands.
这使我们了解使用理想的高层级命令，低层级策略限制了多少性能。
GPT-4o high-level model:
This method uses the same high-level/low-level decomposition as Hi Robot, but queries the GPT-4o API-based model for the high level, while using the same low-level policy.
此方法使用与 Hi Robot 相同的高层级/低层级分解，但在使用相同的低层级策略时，查询基于 GPT-4o API 的高层级模型。
GPT-4o is a significantly larger VLM than the one we use, but it is not finetuned with our real and synthetic datasets.
GPT-4o 是一个比我们使用的大得多的 VLM，但它没有根据我们的真实和合成数据集进行微调。
This comparison is similar to an advanced version of SayCan (Brohan et al., 2023b), which uses an out-of-the-box LLM as a high-level policy, while this baseline uses a VLM.
这种比较类似于 SayCan 的高级版本（Brohan 等人，2023b），它使用开箱即用的 LLM 作为高层级策略，而此基线使用一个 VLM。
To align GPT-4o with the robot’s affordances, we carefully engineer the prompt to include task-relevant instructions that the low-level policy can follow, determined by ranking the most common skill labels in the human-annotated dataset, and ask GPT-4o to choose among them.
为了对齐 GPT-4o 与机器人的 affordances，我们仔细地设计了提示，包括低层级策略可以遵循的任务相关指令，通过对人类标注的数据集中最常见的技能标签进行排序来确定，并要求 GPT-4o 从中进行选择。
Flat VLA:
This comparison directly uses the same $\pi_0$ low-level policy as in Hi Robot, but without any high level or synthetic data, representing a state-of-the-art approach for instruction following (Black et al., 2024).
这个比较直接使用了与 Hi Robot 中相同的 $\pi_0$ 低层级策略，但没有任何高层级或合成数据，代表了最先进的指令遵循方法（Black et al., 2024）。
Flat VLA with synthetic data:
This ablation uses the πo low-level policy by itself, without a high-level model, but includes the synthetic data in the training data for the low-level policy, such that it can still process the complex prompts used in our evaluation.
这种消融本身使用 $π_0$ 低层级策略，没有高层级模型，但是在低层级策略的训练数据中包含了合成数据，这样它仍然可以处理我们评估中使用的复杂提示。
This baseline allows us to evaluate the benefit of hierarchy independent from the effect of synthetic data.
这个基线允许我们独立于合成数据的影响来评估层次结构的好处。
Hi Robot without synthetic data:
This ablation corresponds to our method without synthetic training data, evaluating the importance of including diverse synthetically-generated prompts in training.
这种消融对应于我们没有合成训练数据的方法，评估了在训练中包括各种合成生成提示的重要性。
This ablation can be seen as an advanced VLM-based version of YAY Robot (Shi et al., 2024), a prior system that uses a high-level model to predict language commands for a low-level model.
这种消融可以看作是 YAY Robot 的基于 VLM 的高级版本（Shi et al., 2024），这是一个使用高层级模型来预测低层级模型的语言命令 的先前系统。

5.2. 指标和评估协议

We report two complementary metrics, measured by a human evaluator who is blind to the method being run.
我们报告两个互补的指标，由一个无法获知当前运行的方法的人类评估员测量。
Each evaluation consists of 20 trials per task per method.
每个评估包括每个任务和每种方法的 20 个试验。

Instruction Accuracy (IA).
This score measures how well the high-level policy’s predicted instruction aligns with human intent, requiring multi-modal understanding of the current environment and prompt.
这个分数衡量的是高层级策略预测的指令与人类意图的一致程度，需要对当前环境和提示进行多模态理解。
If the prediction from the high-level model is consistent with both the user’s command and the current observation, the evaluator marks it as a correct prediction; otherwise, it is labeled as incorrect.
如果高层级模型的预测与用户的命令和当前观测一致，评估员将其标记为正确的预测；否则，它将被标记为不正确。
The Instruction Accuracy for a trial is then computed as the proportion of correct predictions out of the total number of predictions.
一次试验的指令准确度是用正确预测数占预测总数的比例来计算的。
For flat baselines, which lack interpretable language predictions, scoring is based on the evaluator’s interpretation of the intent of the policy behavior.
对于缺乏解析的语言预测的 flat 基线，评分是基于评估员对策略行为意图的解析。

Task Progress (TP).
Since all tasks we evaluate are complex and long-horizon, we record task progress to provide a granular view of task completion.
由于我们评估的所有任务都是复杂且 long-horizon，因此我们记录任务进度以提供任务完成情况的细粒度视图。
Task progress quantifies how closely the robot matches the intended goal and is computed by the proportion of objects that are successfully placed in their correct locations or configurations.
任务进度量化了机器人与预定目标的接近程度，并通过成功放置在正确位置或配置的物体的比例来计算。

5.3 核心结果

We present results for our system and two key baselines: a GPT-4o policy and a flat VLA method.
我们介绍了我们的系统的结果和两个关键基线：GPT-4o 策略和 flat VLA 方法。
Quantitative and qualitative results are in Figure 5 and Figure 6, and we summarize our findings below.
定量和定性结果如图 5 和图 6 所示，我们在下面总结了我们的发现。

在这里插入图片描述

Figure 5: Comparisons to Prior Methods.
图 5：与先前方法的比较。
Hi Robot outperforms GPT-4o and flat VLA on Table Bussing, Sandwich Making, and Grocery Shopping.
Hi Robot 在餐桌收拾、三明治制作和杂货店购物方面的表现优于 GPT-4o 和 flat VLA。
Hi Robot averages over 40% higher instruction accuracy than GPT-4o, showing stronger alignment with user prompts and real-time observations, and approaches expert human guidance by leveraging its high-level policy.
Hi Robot 的平均指令准确率比 GPT-4o 高出 40% 以上，与用户提示和实时观测显示出更强的一致性，并通过利用其高层级策略接近人类专家指导。

在这里插入图片描述

Figure 6: Qualitative Command Comparisons.
图 6：定性命令比较。
GPT-4o often (a) misidentifies objects, (b) skips subtasks, or ( c) ignores user intent.
GPT-4o 经常(a) 误识物体，(b) 跳过子任务，或 ( c) 忽略用户意图。
Hi Robot consistently produces commands aligned with the robot’s ongoing actions and user requests.
Hi Robot 始终如一地生成与机器人正在进行的动作和用户请求相一致的命令。
Without synthetic data, the high-level policy aligns well with image observations but ignores user constraints.
如果没有合成数据，高层级策略可以很好地与图像观测保持一致，但忽略了用户约束。

(1) Hi Robot excels at open-ended instruction following.
Hi Robot 擅长遵循开放式指令。
Across all tasks, Hi Robot exhibits substantially higher Instruction Accuracy and Task Progress, compared to GPT-4o and the flat baseline.
在所有任务中，与 GPT-4o 和 flat 基线相比，Hi Robot 表现出更高的指令准确性和任务进度。
It properly identifies, picks up, and places the correct items - even when prompted to handle only certain objects or omit ingredients (e.g., “I’m allergic to pickles”).
它能正确地识别、拾取和放置正确的物品 —— 甚至当提示只处理某些物品或并不加食材时（例如，“我对泡菜过敏”）。
In contrast, GPT-4o frequently loses context once physical interaction begins, issuing nonsensical commands (e.g., "pick up bermuda triangle”) or sometimes labeling everything as “plate” or “spoon,” which disrupts long-horizon planning.
相比之下，一旦身体互动开始，GPT-4o 就经常失去上下文，发出无意义的命令（例如，“拿起百慕大三角”），或者有时将所有东西都标记为“盘子”或“勺子”，这扰乱了长视界long-horizon 规划。

(2) Hi Robot shows strong situated reasoning and adaptation to feedback.
Hi Robot 表现出很强的 situated 推理能力和对反馈的适应能力。
When users modify requests mid-task(e.g., “leave the rest,” “I also want a KitKat”), Hi Robot updates low-level commands accordingly.
当用户在任务中修改请求时 (例如： “剩下的别管了”、“我还想要一块奇巧巧克力”），Hi Robot 会相应地更新低层级命令。
GPT-4o, however, often fails to maintain a coherent internal state, leading to commands like picking up new objects when the gripper is still occupied or prematurely switching tasks.
然而，GPT-4o 经常无法保持一个连贯的内部状态，导致在夹持器仍被占用或贸然切换任务时发出诸如拾取新对象之类的命令。
The flat baseline, on the other hand, does not react to real-time feedback.
另一方面，flat 基线不会对实时反馈做出反应。

(3) Hi Robot is effective across diverse tasks, robots, and user constraints.
Hi Robot 可以在各种任务、机器人和用户约束下均有效。
On single-arm, dual-arm, and mobile bimanual platforms, Hi Robot is able to handle distinct objects (from fragile cheese slices to tall bottles) while respecting dynamic constraints (e.g., "bus only yellowish items,” “don’t add tomatoes”).
在单臂、双臂和移动双臂平台上，Hi Robot 能够处理不同的物体（从易碎的奶酪片到高瓶），同时遵守动态约束（例如，“只处理淡黄色的物品”，“不要添加西红柿”）。
By contrast, the flat baseline and GPT-4o often revert to default behaviors (e.g., picking up every object in sight, or including almost all ingredients in a sandwich) when the prompt changes mid-episode.
相比之下，当提示在回合中发生变化时，flat 基线和 GPT-4o 通常会恢复到默认行为（例如，捡起视线中的每个物体，或者包括三明治中几乎所有的食材）。

(4) Expert human guidance reveals the low-level policy’s strengths but underscores the need for high-level reasoning.
人类专家的指导揭示了低层级策略的优势，但强调了高层级推理的必要性。
With human high-level instructions, the low-level policy executes nearly flawlessly, showing that failures stem more from reasoning than actuation.
有了人类的高层级指令，低层级策略几乎完美地执行，这表明失败更多地源于推理而不是驱动。
However, solely relying on human input is not scalable.
然而，仅仅依靠人类输入是不可扩展的。
Hi Robot bridges this gap via a high-level VLM that aligns with user prompts and real-time observations, whereas GPT-4o’s lack of physical grounding and the flat baseline’s lack of high-level reasoning hinder performance.
Hi Robot 通过与用户提示和实时观测对齐的高层级 VLM 弥补了这一差距，而 GPT-4o 缺乏 physical grounding 和 flat 基线缺乏高层级推理阻碍了性能。

5.4 消融研究

We conduct two key ablations to isolate the contributions of (1) synthetic data for high-level reasoning, and (2) hierarchical decomposition vs. a single "flat” policy.
我们进行了两个关键的消融，以单独考虑 (1) 用于高层级推理的合成数据的贡献，以及 (2) 分层分解与单一 “flat” 策略的对比。

(A) Synthetic data is critical for open-ended instruction following.
合成数据对于开放式指令遵循至关重要。
Comparing Hi Robot (trained on human-labeled + synthetic data) to a variant trained solely on human-labeled data shows that synthetic interactions significantly boost language flexibility (Figure 7).
将 Hi Robot（在人类标记+合成数据上训练）与仅在人类标记数据上训练的变体进行比较，可以发现合成交互显著提高了语言的灵活性（图 7）。
Without them, the ablated model ignores clarifications (e.g., "this is not trash”) or includes forbidden items (e.g., pickles), while Hi Robot smoothly adapts to such feedback, due to the broader coverage of compositional language in synthetic data.
如果没有它们，消融的模型就会忽略澄清（例如，“这不是垃圾”）或包含禁止的物品（例如，泡菜），而 Hi Robot 则会平滑地适应这种反馈，因为合成数据中组合语言的覆盖范围更广。

在这里插入图片描述

Figure 7: Ablation on synthetic data.
图 7：对合成数据的消融。
Synthetic data is essential for handling open-ended instructions, as the model trained without it struggle with user-driven deviations, failing to integrate clarifications and constraints, whereas Hi Robot adapts seamlessly by leveraging diverse, compositional language prompts. (IA = Instruction Accuracy, TP = Task Progress)
合成数据对于处理开放式指令至关重要，因为没有合成数据参与训练的模型会与用户驱动的偏差作斗争，无法整合澄清和约束，而 Hi Robot 通过利用多样化的组合语言提示来无缝适应。（IA =指令准确性，TP =任务进度）

(B) Hierarchical structure outperforms a flat policy.
分层结构优于 flat 策略。
We next compare Hi Robot to a flat policy trained on the same synthetic data but without a separate reasoning step (Figure 8).
接下来，我们将 Hi Robot 与在相同合成数据上训练但没有单独推理步骤的 flat 策略进行比较（图 8）。
The flat model often reverts to clearing all items or fails to handle partial instructions (“bus only the yellowish things”), whereas Hi Robot re-checks the prompt at each high-level step and responds coherently to mid-task updates.
flat 模型通常会恢复到清除所有物品或无法处理部分指令（“只收拾淡黄色的东西”），而 Hi Robot 则会在每个高层级步骤重新检查提示，并对任务中途的更新做出连贯的响应。
This suggests separating high-level reasoning from low-level control is benficial for multi-step coherence and adapting to dynamic user inputs.
这表明将高层级推理与低层级控制分开有利于多步连贯和适应动态用户输入。

在这里插入图片描述

Figure 8: Hierarchical policy vs. flat policy.
图 8：分层策略与 flat 策略。
The hierarchical approach outperforms the flat variant trained on the same data, as it effectively integrates user feedback and partial instructions, whereas the flat model struggles with mid-task clarifications and nuanced task variations. (IA = Instruction Accuracy, TP = Task Progrss)
分层方法优于在相同数据上训练的 flat 变体，因为它有效地整合了用户反馈和部分指令，而 flat 模型则在任务中期澄清和细微的任务变化中挣扎。（IA =指令准确性，TP =任务进度）

6. 讨论和未来工作

We presented Hi Robot, a system that uses vision-language models (VLMs) in a hierarchical structure, first reasoning over complex prompts, user feedback, and language interaction to deduce the most appropriate next step to fulfill the task, and then performing that step by directly outputting low-level action commands.
我们介绍了 Hi Robot，一个使用分层结构的视觉-语言模型（VLMs）的系统，首先对复杂的提示、用户反馈和语言交互进行推理，以推断出完成任务的最合适的下一步，然后通过直接输出低层级动作命令来执行该步骤。
Our system can be thought of as a VLM-based instantiation of the “System 1” and “System 2” architecture (Kahneman, 2011).
我们的系统可以被认为是 “system 1” 和 “system 2” 架构的一个基于 VLM 的实例（Kahneman, 2011）。
The deliberative “System 2” layer takes the form of a high-level VLM policy, which leverages semantic and visual knowledge from web-scale pre-training to reason through complex prompts and user interactions.
慎重的 “系统 2” 层采用高层级 VLM 策略的形式，利用 web 规模预训练的语义和视觉知识，通过复杂的提示和用户交互进行推理。
The physical, reactive "System 1 layer also takes the form of a VLM, trained to directly output robot actions in response to simpler commands that describe atomic behaviors.
物理的、反应性的 “系统 1” 层也采用 VLM 的形式，经过训练可以直接输出机器人动作，以响应描述 atomic 行为的简单命令。

The two VLMs have nearly identical architectures, with the only difference being that the low-level policy uses flow matching to output the actions.
这两个 VLMs 具有几乎相同的架构，唯一的区别是低层级策略使用流匹配来输出动作。
Indeed, the separation of roles at the model level is not fundamental to this design: a natural step for future work is to combine both systems into one model, and draw the “System 1” vs “System 2” distinction purely at inference time.
事实上，模型级的角色分离并不是这种设计的基础：未来工作的自然步骤是将两个系统合并到一个模型中，并在推理时纯粹地绘制 “系统 1” 与 “系统 2” 的区别。
Future work could also interleave high-level and low-level processing more intricately - while our system simply runs high-level inference at a fixed but lower frequency, an adaptive system might simultaneously process inputs and language asynchronously at multiple different levels of abstraction, providing for a more flexible multi-level reasoning procedure.
未来的工作还可以更复杂地交错高层级和低层级处理 —— 虽然我们的系统只是以固定但较低的频率运行高层级推理，但自适应系统可能同时在多个不同的抽象层级异步处理输入和语言，提供更灵活的多层级推理过程。

Our system also has a number of limitations that could be studied in future work.
我们的系统也有一些局限性，可以在未来的工作中加以研究。
While we show that our high-level policy can often break down complex commands into low-level steps that the robot can perform physically, the training process for this high level model relies in some amount of prompt engineering to produce synthetic training examples that induce this behavior.
虽然我们展示了我们的高层级策略通常可以将复杂的命令分解为机器人可以物理执行的低层级步骤，但这个高层级模型的训练过程依赖于一定数量的提示工程来生成诱导这种行为的合成训练示例。
The training process decouples the high-level and low-level models, and they are not aware of one another’s capabilities except through the training examples.
训练过程将高层级模型和低层级模型解耦，并且它们不知道彼此的能力，除非通过训练示例。
Coupling these two layers more directly,e.g. by allowing the high-level policy to be more aware of how successfully the low-level policy completes each command, would be an exciting direction for future work.
更直接地耦合这两层，例如：通过让高层级策略更加了解低层级策略如何成功地完成每个命令，这将是未来工作的一个令人兴奋的方向。
More generally, by instantiating both high-level and low-level reasoning via VLMs, we believe that this design opens the door for much more intricate integration of these components, such that future work might create robotic vision-language-action models that dynamically reason about inputs, feedback, and even their own capabilities to produce suitable situated response in complex open-world settings.
更一般地说，通过由 VLMs 实例化高层级和低层级推理，我们相信这种设计为这些组件更复杂的整合打开了大门，这样未来的工作可能会创建机器人视觉-语言-动作模型，这些模型可以动态地推理输入、反馈，甚至它们自己的能力，以在复杂的开放世界环境中产生合适的 situated 响应。

致谢

We thank Ury Zhilinsky and Kevin Black for their help in setting up the data and training infrastructure.
我们感谢 Ury Zhilinsky 和 Kevin Black 在建立数据和训练基础设施方面的帮助。
We thank Karol Hausman for valuable feedback and discussions on video demonstration and language-following evaluation.
我们感谢 Karol Hausman 在视频演示和语言遵循评估方面提供的宝贵反馈和讨论。
We are also grateful to Noah Brown, Szymon Jakubczak, Adnan Esmail, Tim Jones, Mohith Mothukuri, and Devin LeBlanc for their support in robot maintenance.
我们也感谢…在机器人维护方面的支持。
We appreciate Suraj Nair and Laura Smith for their insightful discussions that helped with policy debugging.
我们感谢 Suraj Nair 和 Laura Smith 的有深刻见解的讨论，这些讨论有助于策略调试。
We also thank Claudio Guglieri for help in creating visualizations used in this paper and on the project website.
我们还感谢 Claudio Guglieri 在本文和项目网站上帮助创建可视化。
Finally, we extend our deepest gratitude to the entire team of robot operators at Physical Intelligence for their immense contributions to data collection, annotation, and policy evaluations.
最后，我们向 Physical Intelligence 的整个机器人操作团队表示最深切的感谢，感谢他们在数据收集、标注和策略评估方面做出的巨大贡献。

参考文献

A. 合成数据生成

A.1. 场景和响应分类

To ensure the quality and diversity of the synthetic data, we incorporate structured scenario classification and response categorization into the prompt design for peen, following (Stephan et al., 2024).
为了确保合成数据的质量和多样性，我们将结构化场景分类和响应分类纳入 peen 的提示设计中，遵循（Stephan et al., 2024）。
Specifically, we classify interactions into different scenario types, such as negative task (where the user instructs the robot what not to do), situated correction (where the user adjusts an earlier command based on the evolving task state), and specific constraint (where the user specifies particular constraints, such as dietary preferences).
具体来说，我们将交互分为不同的场景类型，例如否定式任务（用户指示机器人不要做什么），situated 修正（用户根据不断变化的任务状态调整先前的命令）和特定约束（用户指定特定约束，例如饮食偏好）。
In addition, we categorize the robot’s responses into types such as simple confirmations, clarifications, and error handling.
此外，我们将机器人的响应分为简单确认、澄清和错误处理等类型。
These classifications guide the generation process to ensure a broad range of user-robot interactions.
这些分类指导生成过程，以确保广泛的用户-机器人交互。

A.2. 情境 Grounding 的提示构建

In prompt $\cal P$ , we include a detailed description of the task(e.g., bussing a table, making a sandwich, grocery shopping) and instruct the model to ground responses in visual observations and prior context.
在提示 $\cal P$ 中，我们包含了任务的详细描述(例如：（收拾桌子、做三明治、杂货店购物），并指导模型基于视觉观测和先前的情境做出响应。
A key advantage of leveraging large pretrained VLMs is their ability to incorporate world knowledge when generating interactions.
利用大型预训练 VLMs 的一个关键优势是，它们能够在生成交互时整合世界知识。
For instance, the model can infer dietary constraints when generating prompts for sandwich-making, producing user commands such as “Can you make a sandwich for me? I’m lactose intolerant” and an appropriate robot response like “Sure, I won’t put cheese on it.”
例如，当生成做三明治的提示时，该模型可以推断饮食限制，生成用户命令，如“你能为我做一个三明治吗？我有乳糖不耐症”，还有一个合适的机器人响应，比如 “当然，我不会在上面放奶酪的。”
Similarly, it can reason over ambiguous or implicit requests, such as inferring that “I want something sweet” in a grocery shopping scenario should lead to suggestions like chocolate or candy.
类似地，它可以推理模棱两可或隐含的请求，例如在杂货店购物的场景中推断“我想要甜食”应该导致巧克力或糖果之类的建议。

To maintain consistency in multi-step tasks, we condition $p^\text{gen}$ on prior skill labels within an episode $\hat \ell_0,\cdots,\hat \ell_{t-1}$ , allowing it to generate coherent user commands that account for past actions.
为了保持多步骤任务的一致性，我们在回合 $\hat \ell_0,\cdots,\hat \ell_{t-1}$ 中对先前技能标签添加条件 $p^\text{gen}$ ，让它生成考虑过去动作的连贯用户命令。
For instance, if the robot has already placed lettuce and tomato on a sandwich, the generated user prompt might request additional ingredients that logically follow.
例如，如果机器人已经在三明治上放了生菜和番茄，生成的用户提示可能会要求逻辑上遵循的其它食材。
This ensures that the synthetic interactions reflect realistic task progression rather than isolated commands.
这确保了合成交互反映了实际的任务进度，而不是孤立的命令。
As such, we leverage $p^\text{gen} (\hat \ell_t, u_t|{\bf I}_t^1,\cdots , {\bf I}_t^n, \hat \ell_0, \cdots, \hat \ell_{t-1}, \hat \ell_t,\cal P)$ to produce a richer, more diverse synthetic dataset ${\cal D}_\text{syn}$ that provides meaningful supervision for training our high-level policy.
因此，我们利用 $p^\text{gen} (\hat \ell_t, u_t|{\bf I}_t^1,\cdots, {\bf I}_t^n, \hat \ell_0, \cdots, \hat \ell_{t-1}, \hat \ell_t,\cal p)$ 来生成更丰富，更多样化的合成数据集 ${\cal D}_\text{syn}$ ，为训练我们的高层级策略提供有意义的监督。

While in this work we generate a separate ${\cal D}_\text{syn}$ and train a separate high-level policy for each task (e.g., sandwich making vs. table cleaning) for clarity and ease of benchmarking, the architecture is readily amenable to a unified multi-task formulation.
虽然在这项工作中，我们生成了一个单独的 ${\cal D}_\text{syn}$ ，并为每个任务（例如，做三明治与收拾桌子）训练了一个单独的高层级策略，以便清晰和容易地进行基准测试，但该架构很容易适应统一的多任务范式。
In principle, the same hierarchical approach could be used to train a single high-level policy across a multitude of tasks, facilitating knowledge transfer between task domains and more robust, open-ended robot behavior.
原则上，同样的分层方法可以用于跨多个任务训练单个高层级策略，促进任务域之间的知识迁移和更稳健、开放式的机器人行为。

B. 系统和机器人概述

Our system integrates speech-based interactions and real-time robotic control.
我们的系统整合了基于语音的交互和实时机器人控制。
Below, we detail the components of our system, including audio processing, GPU-based inference, and the robot configurations.
下面，我们详细介绍系统的组件，包括音频处理、基于 GPU 的推理和机器人配置。

B.1. 感知和语言处理

For speech-based interaction, we use a consumer-grade lavalier microphone for audio input.
对于基于语音的交互，我们使用消费级的 lavier 麦克风进行音频输入。
Speech-to-text transcription is handled locally using Whisper large-v2 (Radford et al., 2023).
使用 Whisper large-v2 本地处理语音到文本的转录（Radford et al., 2023）。
For text-to-speech synthesis, we employ the Cartetia API to generate natural and expressive speech outputs.
对于文本到语音的合成，我们使用 Cartetia API 来生成自然且富有表现力的语音输出。

B.2. 推理硬件

To support real-time inference, we utilize one to two NVIDIA GeForce RTX 4090 consumer-grade GPUs.
为了支持实时推理，我们使用一到两个 NVIDIA GeForce RTX 4090 消费级 GPUs。

B.3. 机器人系统详述

We employ three different robot configurations with various manipulation and mobility capabilities.
我们采用三种不同的机器人配置，具有不同的操作和移动能力。

UR5e. This setup features a 6-DoF robotic arm equipped with a parallel jaw gripper.
这个装置的特点是一个 6 自由度的机械臂配备了一个平行夹爪。
It includes two cameras: a wrist-mounted camera and an over-the-shoulder camera.
它包括两个摄像头：一个腕部摄像头和一个肩部摄像头。
The system operates within a 7-dimensional configuration and action space.
该系统在一个 7 维配置和动作空间中运行。

Bimanual ARX. This configuration consists of two 6-DoF ARX arms.
该配置由两个 6-DoF 的 ARX 臂组成。
The system is equipped with three cameras: two wrist-mounted cameras and one base camera.
该系统配备了 3 个摄像头：两个腕部摄像头和一个底座摄像头。
The combined system has a 14-dimensional configuration and action space, enabling dextrous bimanual manipulation tasks.
该组合系统具有 14 维配置和动作空间，使灵巧的双手操作任务成为可能。

Mobile ARX. Built on the Mobile ALOHA (Fu et al., 2024) platform, this system integrates two 6-DoF ARX robotic arms mounted on a mobile base.
该系统基于移动 ALOHA （Fu et al., 2024）平台，集成了安装在移动底座上的两个 6-DoF ARX 机械臂。
The nonholonomic base introduces two additional action dimensions, resulting in a 14-dimensional configuration space and a 16-dimensional action space.
非完整底座引入了两个额外的动作维度，得到一个 14 维的配置空间和一个 16 维的动作空间。
Similar to the bimanual setup, it includes two wrist-mounted cameras and a base camera, providing robust visual feedback for navigation and manipulation.
与双手设置类似，它包括两个腕部摄像头和一个底座摄像头，为导航和操作提供强大的视觉反馈。

【arxiv_20250226v1】Hi Robot：分层策略 ( VLM + VLA ) + 合成数据〔适用：更复杂的提示，整合反馈〕

文章目录

摘要

1. 引言

2. 相关工作

3. 预备知识和问题陈述