---
id: learning-and-ai
title: "Learning & AI"
status: established
source_sections: "reference/sources/github-unitree-rl-gym.md, reference/sources/github-unitree-rl-lab.md, reference/sources/github-xr-teleoperate.md, reference/sources/paper-bfm-zero.md, reference/sources/paper-gait-conditioned-rl.md"
related_topics: [simulation, locomotion-control, manipulation, sdk-programming, whole-body-control, motion-retargeting, push-recovery-balance]
key_equations: []
key_terms: [gait_conditioned_rl, curriculum_learning, sim_to_real, lerobot, xr_teleoperate, teleoperation]
images: []
examples: []
open_questions:
  - "Optimal reward function design for G1 locomotion"
  - "Training time estimates for different policy types"
  - "How to fine-tune the stock locomotion policy"
  - "LLM-based task planning integration status (firmware v3.2+)"
---

# Learning & AI

Reinforcement learning, imitation learning, and AI-based control for the G1.

## 1. Reinforcement Learning

### Official RL Frameworks

| Framework          | Repository                                | Base Library        | Sim Engine    | G1 Support | Tier |
|-------------------|------------------------------------------|--------------------|--------------|-----------:|------|
| unitree_rl_gym     | unitreerobotics/unitree_rl_gym           | legged_gym + rsl_rl | Isaac Gym    | Yes        | T0   |
| unitree_rl_lab     | unitreerobotics/unitree_rl_lab           | Isaac Lab          | Isaac Lab    | G1-29dof   | T0   |

### unitree_rl_gym — Complete RL Pipeline
The primary framework for training locomotion policies: [T0]

- **Supported robots:** Go2, H1, H1_2, G1
- **Algorithm:** PPO (via rsl_rl)
- **Training:** Parallel environments, GPU/CPU device selection, checkpoint management
- **Pipeline:** Train → Play → Sim2Sim (MuJoCo validation) → Sim2Real (unitree_sdk2_python)
- **Deployment:** Python scripts and C++ binaries with network interface configuration

### unitree_rl_lab — Isaac Lab Integration
Advanced RL training on NVIDIA Isaac Lab: [T0]

- **Supported robots:** Go2, H1, G1-29dof
- **Simulation backends:** Isaac Lab (NVIDIA) and MuJoCo (cross-sim validation)
- **Deployment:** Simulation → Sim-to-sim → Real robot via unitree_sdk2
- **Language mix:** Python 65.1%, C++ 31.3%

### Key RL Research on G1

| Paper | Contribution | Validated on G1? | Tier |
|-------|-------------|-----------------|------|
| Gait-Conditioned RL (arXiv:2505.20619) | Multi-phase curriculum, gait-specific reward routing | Yes | T1 |
| Getting-Up Policies (arXiv:2502.12152) | Two-stage fall recovery via RL | Yes | T1 |
| HoST (arXiv:2502.08378) | Multi-critic RL for diverse posture recovery | Yes | T1 |
| Fall-Safety (arXiv:2511.07407) | Unified prevention + mitigation + recovery | Yes (zero-shot) | T1 |
| Vision Locomotion (arXiv:2602.06382) | End-to-end depth-based locomotion | Yes | T1 |
| Safe Control (arXiv:2502.02858) | Projected Safe Set for collision avoidance | Yes | T1 |
| ASAP (sim-to-real correction) | Adaptive Skill Adaptation Pipeline — residual network corrects sim-trained policy using real-world data. 52.7% tracking error reduction on G1. | Yes | T1 |

### WBC-AGILE — Open-Source Training Framework [T1]

NVIDIA's **WBC-AGILE** (`nvidia-isaac/WBC-AGILE`) provides the training framework for GR00T-WBC policies:

- **Repository:** nvidia-isaac/WBC-AGILE (GitHub)
- **Purpose:** Train locomotion + WBC policies for G1 and other humanoids
- **Framework:** Isaac Lab + RSL-RL (PPO)
- **G1 support:** Built-in G1 configuration
- **Deployment:** Exports to ONNX, drop-in replacement for GR00T-WBC pre-trained policies
- **Use cases:** Retraining with corrected dynamics, fine-tuning PD gains, adding push recovery curriculum
- **GB10 compatible:** Isaac Sim/Lab officially supported on GB10/DGX Spark

**Note:** The GR00T-WBC repository is inference-only — it does NOT contain training code. WBC-AGILE is the separate training framework.

## 2. Imitation Learning

### Data Collection — Teleoperation

| System               | Device                    | Repository                            | Features                     |
|---------------------|---------------------------|---------------------------------------|------------------------------|
| XR Teleoperate       | Vision Pro, PICO 4, Quest 3 | unitreerobotics/xr_teleoperate      | Hand tracking, data recording |
| Kinect Teleoperate   | Azure Kinect DK           | unitreerobotics/kinect_teleoperate   | Body tracking, safety wake-up |

### Training Frameworks

| Framework            | Repository                                | Purpose                              |
|---------------------|------------------------------------------|--------------------------------------|
| unitree_IL_lerobot    | unitreerobotics/unitree_IL_lerobot       | Modified LeRobot for G1 dual-arm training |
| HuggingFace LeRobot   | huggingface.co/docs/lerobot/en/unitree_g1 | Standard LeRobot with G1 config      |

**LeRobot G1 integration:** Supports both 29-DOF and 23-DOF versions, includes gr00t_wbc locomotion integration for whole-body control during manipulation tasks. [T1]

### Imitation Learning Workflow
```
1. Teleoperate (XR/Kinect) → record episodes
2. Process data → extract observation-action pairs
3. Train policy (LeRobot / custom) → behavior cloning or diffusion policy
4. Deploy → unitree_sdk2 on real robot
```

## 3. Policy Deployment

### Deployment Options

| Method               | Language | Latency    | Use Case                      |
|---------------------|----------|------------|-------------------------------|
| unitree_sdk2_python  | Python   | Higher     | Prototyping, research          |
| unitree_sdk2 (C++)   | C++      | Lower      | Production, real-time control  |

### Deployment Checklist
1. **Validate in simulation** — Run policy in unitree_mujoco or Isaac Lab
2. **Cross-sim validate** — Test in a second simulator (Sim2Sim)
3. **Low-gain start** — Deploy with reduced gains initially
4. **Tethered testing** — Support robot with a safety harness for first real-world tests
5. **Gradual ramp-up** — Increase to full gains after verifying stability

### Safety Wrappers
When deploying custom policies, add safety layers: [T2 — Best practice]
- Joint limit clamping (see [[equations-and-bounds]])
- Torque saturation limits
- Fall detection with emergency stop
- Velocity bounds for safe walking speeds

## 4. Foundation Models

### BFM-Zero (arXiv:2511.04131)
First behavioral foundation model for real humanoids: [T1]
- **Key innovation:** Promptable control without retraining (reward optimization, pose reaching, motion tracking)
- **Training:** Motion capture data regularization + online off-policy unsupervised RL
- **Validation:** Deployed on G1 hardware
- **Significance:** Enables flexible task specification without policy retraining

### Behavior Foundation Model (arXiv:2509.13780)
- Uses masked online distillation with Conditional Variational Autoencoder (CVAE)
- Models behavioral distributions from large-scale datasets
- Tested on G1 (1.3m, 29-DOF) [T1]

### LLM Integration (Firmware v3.2+)
- Preliminary LLM integration support on EDU models [T2]
- Natural language task commands via Jetson Orin [T2]
- Status and capabilities not yet fully documented — see open questions

## 5. Motion Tracking Policies

RL policies trained to imitate reference motions (from mocap) while maintaining balance: [T1 — Research papers]

| Framework | Paper | Approach | G1 Validated? |
|---|---|---|---|
| BFM-Zero | arXiv:2511.04131 | Foundation model with motion tracking mode | Yes |
| H2O | arXiv:2403.01623 | Real-time human-to-humanoid tracking | Humanoid (not G1 specifically) |
| OmniH2O | arXiv:2406.08858 | Multi-modal input tracking | Humanoid |
| HumanPlus | arXiv:2406.10454 | RGB camera shadow → imitation | Humanoid |

**BFM-Zero** is the most directly G1-relevant: it provides a "motion tracking" mode where the policy receives a reference pose and tracks it while maintaining balance. Zero-shot generalization to unseen motions. Open-source. See [[motion-retargeting]] for the full retargeting pipeline.

**Key insight:** These policies learn to simultaneously track the reference motion AND maintain balance. Push recovery is implicit — the same policy handles both. Training with perturbation curriculum further enhances robustness. See [[push-recovery-balance]].

## 6. Residual Policy Learning

Training a small correction policy on top of an existing base controller: [T1 — Established technique]

```
a_final = a_base + α * a_residual     (α ∈ [0, 1] for safety scaling)
```

- **Base policy:** Stock G1 controller or a pre-trained locomotion policy
- **Residual policy:** Small network trained to improve specific behavior (e.g., push recovery)
- **Scaling factor α:** Limits maximum deviation from base behavior

**Use case for G1:** Enhance the stock controller's push recovery without replacing it entirely. Train the residual in simulation with perturbation curriculum, deploy as an overlay. See [[push-recovery-balance]] §3b.

## 7. Perturbation Curriculum

Training RL policies with progressively increasing external disturbances: [T1 — Multiple G1 papers]

```
Stage 1: No perturbations (learn basic locomotion)
Stage 2: Small random pushes (10-30N, occasional)
Stage 3: Medium pushes (30-80N, more frequent)
Stage 4: Large pushes (80-200N) + terrain variation
Stage 5: Large pushes + concurrent upper-body task
```

This is the primary method for achieving the "always-on balance" goal. Papers arXiv:2505.20619 and arXiv:2511.07407 demonstrate this approach on real G1 hardware. See [[push-recovery-balance]] §3a for detailed parameters.

## 8. MuJoCo Playground Training Pipeline — Verified (2026-02-15) [T1]

GPU-parallelized RL training for G1 locomotion using MuJoCo Playground (Google DeepMind) on the Dell Pro Max GB10.

### Setup
- **Framework:** MuJoCo Playground (`playground` package from GitHub, not PyPI)
- **Environment:** `G1JoystickFlatTerrain` (29-DOF, 103-dim obs, velocity tracking with phase-based gait)
- **Training:** Brax PPO, JAX + CUDA 12 on Blackwell GPU, 8192 parallel MJX environments
- **Throughput:** ~17K steps/sec on GB10 Blackwell

### G1 Environment Details (from source inspection)
- **Observation (103 dims):** linvel(3) + gyro(3) + gravity(3) + command(3) + joint_pos-default(29) + joint_vel(29) + last_act(29) + phase(4)
- **Privileged state (165 dims):** state(103) + clean sensors + actuator force + contact + feet velocity
- **Actions:** 29 joint position targets (all DOF), residual from default pose, scaled by 0.25
- **Control rate:** 50 Hz (0.02s ctrl_dt), physics at 500 Hz (0.002s sim_dt)
- **Push perturbations:** Enabled by default (0.1-2.0 m/s velocity impulse, every 5-10s)
- **23 reward terms** including velocity tracking, gait phase, orientation, foot slip, joint deviation
- **Domain randomization:** Friction (0.4-1.0), mass (±10%), torso mass offset (±1kg), armature (1.0-1.05x)

### Also Available
- `G1JoystickRoughTerrain` — same env with procedural terrain
- H1 gait tracking environments — reference pattern for extending G1 with tracking rewards
- No existing whole-body tracking env for G1 (only H1 and Spot have gait tracking variants)

### Training Results (locomotion-only baseline)
- 5M steps (tiny): 6 min 41 sec, reward -6.4 → -2.8
- 200M steps (full): reward progression -6.4 → +8.8 at 117M steps (training in progress)

### Planned: Unified Whole-Body Control Training
Research direction: fork G1JoystickFlatTerrain to add upper body pose tracking for telepresence (Apple Vision Pro mocap + joystick locomotion). See `plans/eager-shimmying-raccoon.md` for full plan. Approach follows ExBody/ExBody2 paradigm: decouple velocity tracking (lower body) from keypoint tracking (upper body), 4-stage curriculum, ~400M steps.

### Key Open-Source Repos for G1 Whole-Body RL
| Repo | Approach | G1 Validated? |
|------|----------|---------------|
| MuJoCo Playground | GPU-parallelized MJX training, native G1 env | Yes [T1] |
| BFM-Zero (LeCAR-Lab) | Foundation model, motion tracking mode | Yes [T1] |
| BeyondMimic (HybridRobotics) | Whole-body tracking from LAFAN1 | Yes (claimed) |
| H2O / OmniH2O (LeCAR-Lab) | Real-time teleoperation | Humanoid (not G1-specific) |
| ExBody2 (UC San Diego) | Expressive whole-body with velocity decoupling | Humanoid |

## Key Relationships
- Trains in: [[simulation]] (MuJoCo, Isaac Lab, Isaac Gym)
- Deploys via: [[sdk-programming]] (unitree_sdk2 DDS interface)
- Controls: [[locomotion-control]] (RL-trained gait policies)
- Controls: [[manipulation]] (learned manipulation policies)
- Data from: [[manipulation]] (teleoperation → imitation learning)
- Enables: [[motion-retargeting]] (RL-based motion tracking policies)
- Enables: [[push-recovery-balance]] (perturbation curriculum, residual policies)
- Coordinated by: [[whole-body-control]] (WBC training frameworks)