🧠 VLA大模型深度解析¶

⚠️ 时效性说明：本章涉及前沿模型/价格/榜单等信息，可能随版本快速变化；请以论文原文、官方发布页和 API 文档为准。

学习时间：5小时 | 难度：⭐⭐⭐⭐ 中高级 | 前置知识：Transformer、多模态大模型基础

本章目标¶

深入理解VLA（Vision-Language-Action）模型的设计哲学
掌握RT-2、Octo、OpenVLA、π0的架构细节
理解动作标记化（Action Tokenization）与扩散策略（Diffusion Policy）
学会VLA模型的数据收集与微调方法
了解最新进展：GR-2、RDT-1B、Embodied-GPT

1. VLA模型演进¶

1.1 从专用到通用¶

Text Only

第一代: 任务专用模型 (2015-2020)
  - 每个任务单独训练CNN+MLP
  - 不能泛化到新任务/新物体

第二代: 语言条件模型 (2020-2022)
  - 语言指令作为条件输入
  - CLIPort, SayCan, BC-Z
  - 仍需大量机器人数据

第三代: 基础VLA模型 (2023-2024)
  - 利用预训练VLM的视觉-语言理解能力
  - RT-2, Octo, OpenVLA
  - 可在新物体/新场景零样本泛化

第四代: 扩散VLA + 世界模型 (2024-2026)
  - π0, GR-2, RDT-1B
  - 用扩散模型生成连续动作
  - 开始具备物理世界预测能力

1.2 VLA vs VLM¶

Text Only

VLM (视觉-语言模型):
  输入: 图像 + 文本 → 输出: 文本
  代表: GPT-4V, LLaVA, Qwen-VL

VLA (视觉-语言-动作模型):
  输入: 图像 + 语言指令 → 输出: 机器人动作
  代表: RT-2, OpenVLA, π0

  关键挑战: 如何将VLM的语义理解能力"接地"(ground)到物理动作？

2. 核心模型详解¶

2.1 RT-2 (Robotic Transformer 2)¶

Text Only

架构: PaLI-X (55B) 或 PaLM-E (12B) 作为backbone

  输入: [图像tokens] + [语言指令tokens]
  输出: [动作tokens] (将连续动作离散化为token)

动作表示:
  7DoF动作 = [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]
  每个维度量化为256个bin → 7个token
  例: 向前移动5cm = token_137 (x方向第137个bin)

关键创新:
  1. 动作作为"另一种语言"被VLM学习
  2. 网络预训练知识帮助理解新概念
     → "把泰勒·斯威夫特附近的物体移到垃圾桶"
     → 模型知道Taylor Swift是谁(来自网络知识)，能找到她的照片
  3. Chain-of-Thought推理: 先输出推理过程，再输出动作

Python

# RT-2 伪代码结构
class RT2(nn.Module):  # 继承nn.Module定义网络层
    def __init__(self, vlm_backbone, action_bins=256, action_dim=7):
        super().__init__()  # super()调用父类方法
        self.vlm = vlm_backbone  # PaLI-X / PaLM-E
        self.action_bins = action_bins
        self.action_dim = action_dim

        # 将连续动作空间均匀量化
        # 每个维度 [-1, 1] → 256个bin
        self.bin_edges = torch.linspace(-1, 1, action_bins + 1)

    def tokenize_action(self, continuous_action):
        """连续动作 → 离散token"""
        tokens = []
        for d in range(self.action_dim):
            # 找到最近的bin
            val = continuous_action[..., d]
            bin_idx = torch.bucketize(val, self.bin_edges[1:])
            bin_idx = bin_idx.clamp(0, self.action_bins - 1)
            tokens.append(bin_idx)
        return torch.stack(tokens, dim=-1)  # (B, 7)  # torch.stack沿新维度拼接张量

    def detokenize_action(self, tokens):
        """离散token → 连续动作"""
        bin_centers = (self.bin_edges[:-1] + self.bin_edges[1:]) / 2
        actions = []
        for d in range(self.action_dim):
            actions.append(bin_centers[tokens[..., d]])
        return torch.stack(actions, dim=-1)

    def forward(self, images, text_instruction):
        """
        images: (B, T, C, H, W) 历史图像序列
        text_instruction: 语言指令
        """
        # 拼接图像token + 文本token → VLM
        output_tokens = self.vlm(images, text_instruction)

        # 最后7个token即为动作
        action_tokens = output_tokens[:, -self.action_dim:]
        action = self.detokenize_action(action_tokens)

        return action  # (B, 7)

2.2 OpenVLA（开源VLA主力）¶

Text Only

架构: Llama-2 7B + DINOv2/SigLIP 双视觉编码器
  - Llama-2: 语言backbone (7B参数)
  - DINOv2: 自监督视觉特征 (擅长空间理解)
  - SigLIP: 视觉-语言对齐特征 (擅长语义理解)

训练数据: Open X-Embodiment (970K机器人轨迹, 22种机器人)

特点:
  1. 完全开源 (权重 + 代码 + 数据)
  2. 双视觉编码器互补：空间精度 + 语义理解
  3. 7B参数在单GPU上可实时推理

Python

# OpenVLA 推理示例
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

class OpenVLAInference:
    """OpenVLA模型推理封装"""

    def __init__(self, model_path="openvla/openvla-7b"):
        self.processor = AutoProcessor.from_pretrained(
            model_path, trust_remote_code=True
        )
        self.model = AutoModelForVision2Seq.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True
        ).cuda()

    def predict_action(self, image: Image.Image, instruction: str):
        """
        输入一帧图像和语言指令，输出7DoF动作
        """
        # 构造prompt
        prompt = f"In: What action should the robot take to {instruction}?\nOut:"

        # 编码
        inputs = self.processor(prompt, image).to("cuda", dtype=torch.bfloat16)  # 移至GPU/CPU

        # 生成动作token
        action = self.model.predict_action(**inputs, unnorm_key="bridge_orig")

        # action: [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper_open]
        return action

    def run_episode(self, env, instruction, max_steps=200):
        """运行一个完整episode"""
        obs = env.reset()

        for step in range(max_steps):
            image = Image.fromarray(obs['image'])
            action = self.predict_action(image, instruction)

            obs, reward, done, info = env.step(action)

            if done:
                return info.get('success', False)

        return False

2.3 π0（扩散VLA，Physical Intelligence）¶

Text Only

核心创新: 用扩散模型替代自回归token预测

  为什么扩散更好？
  - 自回归(RT-2/OpenVLA): 动作量化为离散bin(256级) → 精度损失
  - 扩散(π0): 直接生成连续动作 → 无量化误差
  - 扩散天然适合多模态分布（同一指令可能有多种合理动作）

  架构:
  - 视觉: SigLIP
  - 语言: Gemma (谷歌开源LLM)
  - 动作: 流匹配(Flow Matching) → 比DDPM更高效

  训练数据: 10000+小时真实机器人数据（发布时最大规模的机器人演示数据集之一）

  结果: 零样本执行折叠衣服、收拾桌子等灵巧操作

Python

import torch
import torch.nn as nn

class FlowMatchingActionHead(nn.Module):
    """
    π0的核心：流匹配动作生成头

    基本思想:
    - 训练时: 学习从噪声分布到动作分布的向量场
    - 推理时: 从噪声出发，沿向量场积分得到动作

    对比DDPM: 流匹配只需要~10步去噪（DDPM需要50-100步）
    """

    def __init__(self, context_dim, action_dim, action_horizon=16):
        super().__init__()
        self.action_dim = action_dim        # 每步动作维度 (7)
        self.action_horizon = action_horizon # 预测未来16步动作
        flat_dim = action_dim * action_horizon

        # 去噪网络: 预测向量场 v(x_t, t, context)
        self.denoiser = nn.Sequential(
            nn.Linear(flat_dim + context_dim + 1, 512),
            nn.SiLU(),
            nn.Linear(512, 512),
            nn.SiLU(),
            nn.Linear(512, 512),
            nn.SiLU(),
            nn.Linear(512, flat_dim)
        )

    def forward(self, noisy_actions, timestep, context):
        """预测向量场 v"""
        t_embed = timestep.unsqueeze(-1)  # unsqueeze增加一个维度
        x = torch.cat([noisy_actions.flatten(-2), context, t_embed], dim=-1)  # torch.cat沿已有维度拼接张量
        return self.denoiser(x).view_as(noisy_actions)

    def training_loss(self, actions, context):
        """
        流匹配训练损失
        x_0: 噪声 (从标准正态分布采样)
        x_1: 真实动作
        x_t = (1-t)*x_0 + t*x_1  (线性插值)
        v_target = x_1 - x_0     (最优向量场)
        """
        B = actions.shape[0]
        x_1 = actions  # 真实动作
        x_0 = torch.randn_like(x_1)  # 噪声

        # 随机时间步
        t = torch.rand(B, device=actions.device)

        # 线性插值
        t_expand = t.view(B, 1, 1)  # 重塑张量形状
        x_t = (1 - t_expand) * x_0 + t_expand * x_1

        # 目标向量场
        v_target = x_1 - x_0

        # 预测向量场
        v_pred = self.forward(x_t, t, context)

        # MSE损失
        loss = ((v_pred - v_target) ** 2).mean()
        return loss

    @torch.no_grad()  # 禁用梯度计算，节省内存
    def sample(self, context, num_steps=10):
        """
        推理: ODE积分生成动作
        x_0 → x_1 (10步欧拉法)
        """
        B = context.shape[0]
        x = torch.randn(B, self.action_horizon, self.action_dim,
                         device=context.device)

        dt = 1.0 / num_steps
        for i in range(num_steps):
            t = torch.full((B,), i * dt, device=context.device)
            v = self.forward(x, t, context)
            x = x + v * dt

        return x  # (B, horizon, action_dim)

3. Diffusion Policy（扩散策略）¶

3.1 为什么扩散适合机器人？¶

Text Only

问题: 人类示教数据是多模态的
  - 同一指令"把杯子放架子上"，不同人放的位置不同
  - 传统MSE回归 → 取平均 → 动作在所有模态之间(不合理)
  - 扩散模型 → 可以采样不同模态 → 每次执行一种合理方案

对比:
  MSE回归:   ▓░░░░░░░░░░░▓  ← 两个模态的平均（无效动作）
  扩散采样:   ▓            或            ▓  ← 采样到其中一个模态

3.2 Diffusion Policy实现¶

Python

class DiffusionPolicy(nn.Module):
    """
    Diffusion Policy (Chi et al., RSS 2023)
    条件扩散模型用于行为克隆
    """

    def __init__(self, obs_encoder, noise_pred_net,
                 obs_horizon=2, pred_horizon=16, action_dim=7,
                 num_diffusion_steps=100):
        super().__init__()
        self.obs_encoder = obs_encoder
        self.noise_net = noise_pred_net
        self.obs_horizon = obs_horizon
        self.pred_horizon = pred_horizon
        self.action_dim = action_dim
        self.num_steps = num_diffusion_steps

        # DDPM噪声调度
        betas = torch.linspace(1e-4, 0.02, num_diffusion_steps)
        alphas = 1.0 - betas
        alphas_cumprod = torch.cumprod(alphas, dim=0)
        self.register_buffer('alphas_cumprod', alphas_cumprod)
        self.register_buffer('sqrt_alphas_cumprod', alphas_cumprod.sqrt())
        self.register_buffer('sqrt_one_minus_alphas_cumprod',
                            (1 - alphas_cumprod).sqrt())
        self.register_buffer('betas', betas)

    def forward(self, obs_seq, actions=None):
        """
        obs_seq: (B, obs_horizon, obs_dim) 观察历史
        actions: (B, pred_horizon, action_dim) 训练用
        """
        # 编码观察
        obs_features = self.obs_encoder(obs_seq)  # (B, feat_dim)

        if actions is not None:
            return self._compute_loss(obs_features, actions)
        else:
            return self._sample(obs_features)

    def _compute_loss(self, obs_features, actions):
        """训练: 预测噪声"""
        B = actions.shape[0]
        noise = torch.randn_like(actions)

        # 随机时间步
        t = torch.randint(0, self.num_steps, (B,), device=actions.device)

        # 加噪
        # view(B,1,1)将每个样本的标量α值扩展为(B,1,1)，以广播方式与(B,预测步长,动作维度)的张量逐元素相乘
        sqrt_alpha = self.sqrt_alphas_cumprod[t].view(B, 1, 1)
        sqrt_one_minus_alpha = self.sqrt_one_minus_alphas_cumprod[t].view(B, 1, 1)
        noisy_actions = sqrt_alpha * actions + sqrt_one_minus_alpha * noise

        # 预测噪声
        noise_pred = self.noise_net(noisy_actions, t, obs_features)

        return ((noise_pred - noise) ** 2).mean()

    @torch.no_grad()
    def _sample(self, obs_features, use_ddim=True, ddim_steps=10):
        """推理: 从噪声去噪采样"""
        B = obs_features.shape[0]
        x = torch.randn(B, self.pred_horizon, self.action_dim,
                         device=obs_features.device)

        if use_ddim:
            # DDIM加速采样 (100步 → 10步)
            step_indices = torch.linspace(0, self.num_steps - 1, ddim_steps).long()
            for i in reversed(range(len(step_indices))):
                t = step_indices[i].expand(B).to(obs_features.device)
                noise_pred = self.noise_net(x, t, obs_features)

                alpha = self.alphas_cumprod[t].view(B, 1, 1)
                alpha_prev = self.alphas_cumprod[step_indices[i-1]].view(B, 1, 1) \
                             if i > 0 else torch.ones(B, 1, 1, device=x.device)

                x0_pred = (x - (1 - alpha).sqrt() * noise_pred) / alpha.sqrt()
                x = alpha_prev.sqrt() * x0_pred + \
                    (1 - alpha_prev).sqrt() * noise_pred

        return x  # (B, pred_horizon, action_dim)

4. 数据收集与训练¶

4.1 机器人数据收集方法¶

方法	效率	质量	代表
遥操作(VR手柄)	低（1:1时间）	高	ALOHA, Mobile ALOHA
遥操作(主从臂)	低	很高	Gello, UMI
人类视频学习	极高(YouTube)	低（无动作标注）	RT-2, GR-1
仿真数据	极高	中（sim2real gap）	RoboGen, GenSim
自主探索	高	低→逐渐提高	RoboAgent

4.2 VLA微调¶

Python

from transformers import AutoModelForVision2Seq, AutoProcessor
from torch.utils.data import DataLoader
import torch

class VLAFineTuner:
    """VLA模型微调（以OpenVLA为例）"""

    def __init__(self, model_path, learning_rate=2e-5, use_lora=True):
        self.processor = AutoProcessor.from_pretrained(
            model_path, trust_remote_code=True
        )
        self.model = AutoModelForVision2Seq.from_pretrained(
            model_path, torch_dtype=torch.bfloat16, trust_remote_code=True
        ).cuda()

        if use_lora:
            self._apply_lora()

        self.optimizer = torch.optim.AdamW(
            # filter+lambda过滤出requires_grad=True的参数（排除冻结的视觉编码器），只优化LoRA等可训练部分
            filter(lambda p: p.requires_grad, self.model.parameters()),  # lambda匿名函数
            lr=learning_rate, weight_decay=0.01
        )

    def _apply_lora(self, rank=32, alpha=64):
        """对Llama部分应用LoRA（冻结视觉编码器）"""
        from peft import LoraConfig, get_peft_model

        lora_config = LoraConfig(
            r=rank,
            lora_alpha=alpha,
            target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
            lora_dropout=0.05,
            bias="none"
        )
        self.model = get_peft_model(self.model, lora_config)

        # 冻结视觉编码器
        for name, param in self.model.named_parameters():
            if "vision" in name.lower():
                param.requires_grad = False

        trainable = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        total = sum(p.numel() for p in self.model.parameters())
        print(f"可训练参数: {trainable/1e6:.1f}M / {total/1e6:.1f}M "
              f"({100*trainable/total:.1f}%)")

    def train_epoch(self, dataloader):
        """训练一个epoch"""
        self.model.train()  # train()训练模式
        total_loss = 0

        for batch in dataloader:
            images = batch['images'].cuda()
            instructions = batch['instructions']
            target_actions = batch['actions'].cuda()

            # 前向传播
            outputs = self.model(
                images=images,
                text=instructions,
                labels=target_actions  # 动作token作为标签
            )

            loss = outputs.loss

            # 反向传播
            self.optimizer.zero_grad()  # 清零梯度
            loss.backward()  # 反向传播计算梯度
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
            self.optimizer.step()  # 更新参数

            total_loss += loss.item()  # 将单元素张量转为Python数值

        return total_loss / len(dataloader)

# 微调超参数建议
# - 学习率: 2e-5 (全参) / 5e-5 (LoRA)
# - 批大小: 16-64 (取决于显存)
# - 训练轮数: 50-200 (取决于数据量)
# - 数据量: 至少50条有效轨迹（小规模）, 500+条（鲁棒）

5. 前沿VLA模型¶

5.1 GR-2 (字节跳动, 2024)¶

Text Only

特点:
  - 世界模型 + 策略模型的统一训练
  - 先学预测未来视频帧 → 再学生成动作
  - 38B参数，Sora级别的视频生成能力

架构:
  视频数据(大规模) ──→ 视频预测预训练 ──→ 世界模型
                                              ↓
  机器人数据(少量) ──→ 动作微调 ──→ 策略模型

意义: 世界模型让机器人"想象"行动后果，减少对真实数据的依赖

5.2 RDT-1B (清华, 2024)¶

Text Only

特点:
  - 基于DiT (Diffusion Transformer) 的VLA
  - 1.2B参数，高效率
  - 统一控制单/双臂，灵巧手
  - 支持多种动作空间

关键设计:
  - 物理可解释的动作表示
  - 分层扩散: 先生成粗动作 → 再细化

5.3 模型对比表¶

模型	参数量	动作类型	视觉编码器	开源	特色
RT-2	55B	离散token	ViT	✗	网络知识迁移
Octo	93M	离散+扩散	ResNet	✓	轻量多机器人
OpenVLA	7B	离散token	DINOv2+SigLIP	✓	开源主力
π0	~3B	流匹配	SigLIP	✗	灵巧操作SOTA
GR-2	38B	连续	Video-MAE	✗	世界模型
RDT-1B	1.2B	扩散	SigLIP	✓	高效DiT

6. 练习题¶

概念理解¶

RT-2将动作tokenize为离散bin，有什么优缺点？
为什么π0选择流匹配而不是DDPM？
双视觉编码器(DINOv2+SigLIP)比单编码器好在哪？
世界模型(如GR-2)如何减少对机器人数据的依赖？

代码实践¶

入门：实现一个简化版的Action Tokenizer（256 bins量化+反量化）
进阶：用DDPM实现一个条件扩散策略(2D推箱子环境)
高级：在SIMPLER环境中微调OpenVLA，评估零样本泛化

面试高频题¶

VLA和传统视觉伺服(Visual Servoing)的本质区别是什么？
自回归VLA(RT-2)和扩散VLA(π0)各自适合什么场景？
如何评估VLA模型的泛化能力？设计一个evaluation protocol
从OpenVLA微调到你的特定机器人需要多少数据？如何提高数据效率？
VLA的延迟问题如何解决？（模型量化、动作缓冲、异步推理）

最后更新：2026年2月