LeRobot SO-ARM100 学习笔记(1) ACT module-EW帮帮网

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

前言

在研究openvla 的过程中，偶然看到一个 SmolVLA 及其 LeRobot 开源项目，感觉很有趣, 因此花时间尝试了一把。有关SmolVLA的内容会在其他博文中分享。lerobot是个hugginface的开源项目。
地址：https://github.com/huggingface/lerobot
在这里插入图片描述

为了减少无谓的时间，博主购买了 seeed studio的产品（工件是3D打印的非常粗糙，差点搞死博主，但是考虑其价格，相关产品还是很方便开源研究，毕竟全都准备好了且相对亲民实惠）下图是笔者搭建好的leader 机械臂（黑色） + follower 机械臂（白色）。
在这里插入图片描述
笔者的开源工程地址：https://github.com/MexWayne/mexwayne_lerobot_0605，趟了很多坑。笔者也写了自己遇到的问题并且相关tips 也附上，欢迎交流。

1 LeRobot

1.1 Lerobot 概述

lerobot （到6月30为止）。是一个用了 aloha 课题的以及相关ACT模型。
ACT 非常推荐csdn 文章：https://blog.csdn.net/v_JULY_v/article/details/135454242
ACT大体结构如下：
在这里插入图片描述

ACT 配置如下

    # Architecture.
    # Vision backbone.
    vision_backbone: str = "resnet18"
    pretrained_backbone_weights: str | None = "ResNet18_Weights.IMAGENET1K_V1"
    replace_final_stride_with_dilation: int = False
    # Transformer layers.
    pre_norm: bool = False
    dim_model: int = 512
    n_heads: int = 8
    dim_feedforward: int = 3200
    feedforward_activation: str = "relu"
    n_encoder_layers: int = 4
    # Note: Although the original ACT implementation has 7 for `n_decoder_layers`, there is a bug in the code
    # that means only the first layer is used. Here we match the original implementation by setting this to 1.
    # See this issue https://github.com/tonyzhaozh/act/issues/25#issue-2258740521.
    n_decoder_layers: int = 1
    # VAE.
    use_vae: bool = True
    latent_dim: int = 32
    n_vae_encoder_layers: int = 4

    # Inference.
    # Note: the value used in ACT when temporal ensembling is enabled is 0.01.
    temporal_ensemble_coeff: float | None = None

    # Training and loss computation.
    dropout: float = 0.1
    kl_weight: float = 10.0

    # Training preset
    optimizer_lr: float = 1e-5
    optimizer_weight_decay: float = 1e-4
    optimizer_lr_backbone: float = 1e-5

1.2 lerobot 环境搭建

根据官网的环境搭建，非常容易也非常准确,一步一步执行即可
在这里插入图片描述
对于miniconda，用anaconda 也没有问题。然后跑一把 push T 的 case，也没有问题。

python -m lerobot.scripts.visualize_dataset \
    --repo-id lerobot/pusht \
    --episode-index 0

在这里插入图片描述
其中网站上的以及代码中没有给出dataset 全集，笔者在 issue中扒拉出来了 dataset的全集

repo_id                                 | v1.6   | v2.0  
--------------------------------------- | ------ | ------
lerobot/aloha_sim_insertion_human_image | 0.0036 | 0.0037
lerobot/aloha_sim_insertion_human       | 0.0029 | 0.0027
lerobot/pusht_image                     | 0.0003 | 0.0003
lerobot/pusht                           | 0.0011 | 0.0009
aliberts/koch_tutorial                  | 0.0111 | 0.0106
lerobot/aloha_mobile_cabinet            | 0.0104 | 0.0101
------------------------------------------------------------

详细见：https://github.com/huggingface/lerobot/pull/461

1.3 lerobot 硬件环境

在这里插入图片描述

安装丽娜姐可以参考：https://wiki.seeedstudio.com/cn/lerobot_so100m/
要注意的是，需要将 usb的接口权限改为 777而不是 666
运行

python lerobot/scripts/control_robot.py \
  --robot.type=so100 \
  --robot.cameras='{}' \
  --control.type=teleoperate

如果是第一次运行会让你进行校准，校准的文件在以下截图，校准方法链接也讲的很明白

在这里插入图片描述
校准好后，就可以正常操作

leader 臂控制follower臂

2 models

在这里插入图片描述
目前lerobot 已经做好了这么些模型

2.1 ACT

ACT核心理念是用 Transformer 编码过去的状态/动作，然后并行预测未来动作 chunk（如一次预测 5 个未来动作），在机器人上极大提高推理效率(注意ACT 原文里是用 CNN 做VAE 所以叫做CVAE)
在这里插入图片描述
代码中也有注解(注解的非常优雅)

    """Action Chunking Transformer: The underlying neural network for ACTPolicy.

    Note: In this code we use the terms `vae_encoder`, 'encoder', `decoder`. The meanings are as follows.
        - The `vae_encoder` is, as per the literature around variational auto-encoders (VAE), the part of the
          model that encodes the target data (a sequence of actions), and the condition (the robot
          joint-space).
        - A transformer with an `encoder` (not the VAE encoder) and `decoder` (not the VAE decoder) with
          cross-attention is used as the VAE decoder. For these terms, we drop the `vae_` prefix because we
          have an option to train this model without the variational objective (in which case we drop the
          `vae_encoder` altogether, and nothing about this model has anything to do with a VAE).

                                 Transformer
                                 Used alone for inference
                                 (acts as VAE decoder
                                  during training)
                                ┌───────────────────────┐
                                │             Outputs   │
                                │                ▲      │
                                │     ┌─────►┌───────┐  │
                   ┌──────┐     │     │      │Transf.│  │
                   │      │     │     ├─────►│decoder│  │
              ┌────┴────┐ │     │     │      │       │  │
              │         │ │     │ ┌───┴───┬─►│       │  │
              │ VAE     │ │     │ │       │  └───────┘  │
              │ encoder │ │     │ │Transf.│             │
              │         │ │     │ │encoder│             │
              └───▲─────┘ │     │ │       │             │
                  │       │     │ └▲──▲─▲─┘             │
                  │       │     │  │  │ │               │
                inputs    └─────┼──┘  │ image emb.      │
                                │    state emb.         │
                                └───────────────────────┘
    """

2.1.1 ACTPolicy 的 init

下来我们过下 ACT代码.
最开始我们会看到ACT 的 policy
在这里插入图片描述
(1) ACTPolicy 是整个 ACT 策略模型类,它继承自 PreTrainedPolicy，意味着支持权重加载、保存等通用功能.
(2) config_class 指定了它默认使用的配置结构（ACTConfig）.
(3) name = “act” 是策略工厂 (make_policy) 用来注册时识别这个策略名的 key。

在这里插入图片描述
这里对 inputs 和 output 进行了 normalize 方便计算lost. 因为最终是要将 output 变成机器人的action, 所以将output再 unnormalize回到正常范围.

2.1.2 get_optim_params

在这里插入图片描述
这里 77 行用了 temporal_ensemble_coeff, temporla_ensemble_coeff是时间集成的衰减因子, 如果配置中启用了时间集成（用于测试或策略稳定），就创建一个 ACTTemporalEnsembler,它的功能可能是做输出动作的滑动平均或 chunk 级别的融合；常见于 Diffusion/Transformer 模型中对未来动作进行时间平滑预测。

这里有个细节:
not backbone 和 backbone为了对模型的不同部分设置不同的学习率（learning rate），是深度学习中非常常见的一种训练技巧 。

非视觉 backbone 的参数（使用默认 lr）

{
    "params": [
        p for n, p in self.named_parameters()
        if not n.startswith("model.backbone") and p.requires_grad
    ]
}

视觉 backbone（用较小 lr 微调）

{
    "params": [
        p for n, p in self.named_parameters()
        if n.startswith("model.backbone") and p.requires_grad
    ],
    "lr": self.config.optimizer_lr_backbone
}

2.1.3 select_action

我们继续看代码: 看到这里红框处就是用到了时域平滑考虑了多组action.
当然如果没有时间平滑,那么就是 134 到 142 行代码直接用_action_queue的action 逐个输出.
在这里插入图片描述

2.1.4 forward

下来是 forward 函数
在这里插入图片描述 forward 函数很明确,:
(1) 将输入(batch)调用准备好的 normalize 归一化.
(2) 讲batch 灌入model 得到 batch, 这里的 batch是个dict, 不同的输出结果会更新batch 结构里的 input action或者 observation.
例如:

batch = {
    "observation.state": Tensor,     # 输入给模型的状态向量
    "observation.images.top": Tensor,  # 图像
    "action": Tensor,                # ground-truth 动作（作为训练标签）
    ...
}

这句代码:

actions_hat, (mu_hat, log_sigma_x2_hat) = self.model(batch)

actions_hat 是预测动作
mu_hat 和 log_sigma_x2_hat 是latent 分布参数,即μ, logσ²

变量名	类型	含义
`actions_hat`	`Tensor (B, T, D)`	模型预测的动作序列，T 是 chunk_size，D 是动作维度
`mu_hat`	`Tensor (B, latent_dim)`	编码器输出的 latent 分布的均值
`log_sigma_x2_hat`	`Tensor (B, latent_dim)`	编码器输出的 latent 分布的 log 方差（log σ²）

训练和推理都会用到

场景	`actions_hat`	`(μ, logσ²)`
训练阶段（`forward()`）	用于计算动作 loss	用于计算 KL 散度
推理阶段（`select_action()`）	用于动作预测	通常不会用到 latent

use_vae是默认打开的,所以我们得到的loss 按照如下过程

        loss_dict = {"l1_loss": l1_loss.item()}
        if self.config.use_vae:
            # Calculate Dₖₗ(latent_pdf || standard_normal). Note: After computing the KL-divergence for
            # each dimension independently, we sum over the latent dimension to get the total
            # KL-divergence per batch element, then take the mean over the batch.
            # (See App. B of https://arxiv.org/abs/1312.6114 for more details).
            mean_kld = (
                (-0.5 * (1 + log_sigma_x2_hat - mu_hat.pow(2) - (log_sigma_x2_hat).exp())).sum(-1).mean()
            )
            loss_dict["kld_loss"] = mean_kld.item()
            loss = l1_loss + mean_kld * self.config.kl_weight

注释中: See App. B of https://arxiv.org/abs/1312.6114 for more details 这里笔者扫了下细节符合注释的描述
在这里插入图片描述
也就是按照推导结果,得到 KL散度在当前情况的推导结果.

2.2 ACTTemporalEnsembler

按照代码的注释, 这个过程的代码来自于下面截图, 论文来自于注释:

      """Temporal ensembling as described in Algorithm 2 of https://arxiv.org/abs/2304.13705.

在这里插入图片描述

算法步骤	意图	LeRobot 代码中对应位置
① `π_θ` 为训练好的策略	准备开始推理	`self.model(batch)`
② 初始化 buffer `B[0:T]`	为每个 future timestep 准备 FIFO 动作缓存	`self.action_queue`, 或内部结构
③ 遍历每个时间步 t	每帧执行一次推理/取动作	`select_action()`
④ 预测未来动作 `â_{t:t+k}`	利用 transformer 一次输出一段动作	`actions_hat = self.model(batch)`
⑤ 将预测的动作存入对应 buffer	用于 ensemble 操作	`self.temporal_ensembler.update(actions)`
⑥ 获取当前时刻所有候选动作 `A_t = B[t]`	对该时间步的历史输出做加权	`A_t` 是 action cache
⑦ 应用加权平均 $a_t = \sum w_i A_t[i] / \sum w_i$	时间平滑	`ACTTemporalEnsembler.update()` 逻辑中的 `exp_weights`

2.1.1 init

self.chunk_size = chunk_size
self.ensemble_weights = torch.exp(-temporal_ensemble_coeff * torch.arange(chunk_size))
self.ensemble_weights_cumsum = torch.cumsum(self.ensemble_weights, dim=0)
self.reset()

按照 ACT模型, 模型一次 forward，输出 100 个动作,而不是像传统策略那样每次只输出 1 个。比如, 当前时间步是 t = 0；
模型输出一个 chunk：[a_0, a_1, …, a_99], 然后你只执行前 1 个动作 a_0，其余 99 个存在 buffer 里；

变量名	含义	用途
`self.chunk_size`	每个动作 chunk 的长度	用于控制动作长度
`self.ensemble_weights`	时间加权数组：权重 $w_i = \exp(-m \cdot i)$	指定每个位置的时间权重（旧的权重大）
`self.ensemble_weights_cumsum`	累积权重和	用于后续归一化等操作
`self.reset()`	重置内部状态	初始化缓存（如下）

2.1.2 update

        self.ensemble_weights = self.ensemble_weights.to(device=actions.device)
        self.ensemble_weights_cumsum = self.ensemble_weights_cumsum.to(device=actions.device)

注意,我们一般的只有 weight,这里加了个ensemble_weights_cumsum.
对于 ensemble_weight ,是原始的时间权重序列, 而ensemble_weights_cumsum 是 ensemble_weight逐步累加和.
比如:

ensemble_weights = [1.0, 0.9900, 0.9801, 0.9703]
ensemble_weights_cumsum = [1.0, 1.99, 2.97, 3.94]

如果我在时间上融合了三个chunk, 那么就有:

$\overline{a}=\frac{a_0w_0+a_1w_1+a_2w_2+....}{w_0+w_1+w_2+...}$

所以接下来的代码就会按照截图甲醛模式进行计算.
当我们没有开启 ensemble action时, 就是

        if self.ensembled_actions is None:
            # Initializes `self._ensembled_action` to the sequence of actions predicted during the first
            # time step of the episode.
            self.ensembled_actions = actions.clone()
            # Note: The last dimension is unsqueeze to make sure we can broadcast properly for tensor
            # operations later.
            self.ensembled_actions_count = torch.ones(
                (self.chunk_size, 1), dtype=torch.long, device=self.ensembled_actions.device
            )

如果开启那么:

            self.ensembled_actions *= self.ensemble_weights_cumsum[self.ensembled_actions_count - 1]
            self.ensembled_actions += actions[:, :-1] * self.ensemble_weights[self.ensembled_actions_count]
            self.ensembled_actions /= self.ensemble_weights_cumsum[self.ensembled_actions_count]

完成:
$a_i * w_i$
$denominator += w_i * a_i$
计算 $\overline{a}$

ACT 因为每次都是生成 chunk size 个,但是只能在当前执行一个动作,所以:
第一次调用 update(), 输出 a₀，保留 [a₁, a₂, a₃]
第二次调用 update(), 输出 a₁，保留 [a₂, a₃]

2.3 ACT

2.3.1 init

ACT module init代码结构

第一部分 VAE:

if self.config.use_vae:
    ...

整体功能如下:
(1) 用于生成 (mu, log_sigma²) latent 分布；即生成 $q (z ∣ o b s, a c t i o n)$ 分布
(2) 输入为 [cls, robot_state, action_seq]；
(3) 输出为：latent 维度 × 2（因为要输出均值 + log 方差）；
(4) 只有在 use_vae=True 时启用（训练阶段），推理时不使用。

其中:

self.vae_encoder = ACTEncoder(config, is_vae_encoder=True)

定义vae encoder

其中:

            # Projection layer for joint-space configuration to hidden dimension.
            if self.config.robot_state_feature:
                self.vae_encoder_robot_state_input_proj = nn.Linear(
                    self.config.robot_state_feature.shape[0], config.dim_model
                )

定义 state 如何投射

其中:

            # Projection layer for action (joint-space target) to hidden dimension.
            self.vae_encoder_action_input_proj = nn.Linear(
                self.config.action_feature.shape[0],
                config.dim_model,
            )

定义 action 如何投射

其中:

            # Projection layer from the VAE encoder's output to the latent distribution's parameter space.
            self.vae_encoder_latent_output_proj = nn.Linear(config.dim_model, config.latent_dim * 2)

定义vea output 的 latent space

其中:

            # Fixed sinusoidal positional embedding for the input to the VAE encoder. Unsqueeze for batch
            # dimension.
            num_input_token_encoder = 1 + config.chunk_size
            if self.config.robot_state_feature:
                num_input_token_encoder += 1
            self.register_buffer(
                "vae_encoder_pos_enc",
                create_sinusoidal_pos_embedding(num_input_token_encoder, config.dim_model).unsqueeze(0),
            )

是用sin 进行位置编码

第二部分 CNN Backbone:
注意之前有配置
在这里插入图片描述
所以这段代码都是cnn代码的准备.

第三部分: 经典的encoder decoder

        # Transformer (acts as VAE decoder when training with the variational objective).
        self.encoder = ACTEncoder(config)
        self.decoder = ACTDecoder(config)

第四部分: transformer 的 input

        # Transformer encoder input projections. The tokens will be structured like
        # [latent, (robot_state), (env_state), (image_feature_map_pixels)].
        if self.config.robot_state_feature:
            self.encoder_robot_state_input_proj = nn.Linear(
                self.config.robot_state_feature.shape[0], config.dim_model
            )
        if self.config.env_state_feature:
            self.encoder_env_state_input_proj = nn.Linear(
                self.config.env_state_feature.shape[0], config.dim_model
            )
        self.encoder_latent_input_proj = nn.Linear(config.latent_dim, config.dim_model)
        if self.config.image_features:
            self.encoder_img_feat_input_proj = nn.Conv2d(
                backbone_model.fc.in_features, config.dim_model, kernel_size=1
            )
        # Transformer encoder positional embeddings.
        n_1d_tokens = 1  # for the latent
        if self.config.robot_state_feature:
            n_1d_tokens += 1
        if self.config.env_state_feature:
            n_1d_tokens += 1
        self.encoder_1d_feature_pos_embed = nn.Embedding(n_1d_tokens, config.dim_model)
        if self.config.image_features:
            self.encoder_cam_feat_pos_embed = ACTSinusoidalPositionEmbedding2d(config.dim_model // 2)

注意这里不是vae 的输入,而是 transformer的输入, 包括
(1) state(包括 robot state 和 env state),
(2) latent space
(3) position
这里很明显,对映于

第五部分: decoder 位置编码

        # Transformer decoder.
        # Learnable positional embedding for the transformer's decoder (in the style of DETR object queries).
        self.decoder_pos_embed = nn.Embedding(config.chunk_size, config.dim_model)

第六部分: output 多头:

        # Final action regression head on the output of the transformer's decoder.
        self.action_head = nn.Linear(config.dim_model, self.config.action_feature.shape[0])

就是将每个token 映射为一个 action

2.3.2 forward

forward 分训练和推理的分支, 当进行训练时, vae 部分开启,当没有训练时,只有 encoder decoder过程.
在这里插入图片描述
(1) 这里为了画图方便输出结果按照 log_sigma_x2, mu, robot action 的顺序,其实真实结果顺序是:actions, (mu, log_sigma_x2)
(2) cnn 是 resnet18

2.3.3 ACTEncoder

在这里插入图片描述
笔者特地去看了下tony zhao的ACT

配置是6&6，而lerobot中的ACT是 4-encoder & 1-decoder, 这里作者也作出了说明(有兴趣可以试试6&6的情况)

在这里插入图片描述

在这里插入图片描述
这里代码比较清晰,就是 4个 encoder 串联执行

2.3.4 ACTEncoder

在这里插入图片描述
(1) MLP:

x = self.linear2(self.dropout(self.activation(self.linear1(x))))

就是这个 ACT模型的 feedforward 结构, 在 Transformer 中，MLP（多层感知机）和 Feed Forward Network（前馈网络，简称 FFN）是一个意思.

(2) pre_norm:

   pre_norm: bool = False

如果pre_norm 是真,那么代码应该是红框内容:

即,残差(skipp)没有经过归一化(layernorm)

如果pre_norm 是假,那么代码则是
在这里插入图片描述
即,残差(skipp)经过归一化(layernorm), 这样信息丢失, 容易梯度爆炸.

pre_norm 在配置中就是false 理论上这里是为了防止深度过深梯度爆炸,不好训练,但是目前就4层,所以默认设置为false. 所以也还好.

2.3.5 ACTDecoder & ACTDecoderLayer

和 encoder 一样, 有pre_norm的区别.
有个写法:

        x = self.multihead_attn(
            query=self.maybe_add_pos_embed(x, decoder_pos_embed),
            key=self.maybe_add_pos_embed(encoder_out, encoder_pos_embed),
            value=encoder_out,
        )[0]

和如下是一样的,没有区别

        output,_ = self.multihead_attn(
            query=self.maybe_add_pos_embed(x, decoder_pos_embed),
            key=self.maybe_add_pos_embed(encoder_out, encoder_pos_embed),
            value=encoder_out,
        )

LeRobot SO-ARM100 学习笔记(1) ACT module

文章目录

前言

1 LeRobot

1.1 Lerobot 概述

1.2 lerobot 环境搭建

1.3 lerobot 硬件环境

2 models

2.1 ACT

2.1.1 ACTPolicy 的 init

2.1.2 get_optim_params

2.1.3 select_action

2.1.4 forward

2.2 ACTTemporalEnsembler

2.1.1 init

2.1.2 update

2.3 ACT

2.3.1 init

2.3.2 forward

2.3.3 ACTEncoder

2.3.4 ACTEncoder

2.3.5 ACTDecoder & ACTDecoderLayer

网站公告

今日签到

热门文章

最新发布