---
id: motion-retargeting
title: "Motion Capture & Retargeting"
status: established
source_sections: "reference/sources/paper-bfm-zero.md, reference/sources/paper-h2o.md, reference/sources/paper-omnih2o.md, reference/sources/paper-humanplus.md, reference/sources/dataset-amass-g1.md, reference/sources/github-groot-wbc.md, reference/sources/community-mocap-retarget-tools.md"
related_topics: [whole-body-control, joint-configuration, simulation, learning-and-ai, equations-and-bounds, push-recovery-balance]
key_equations: [inverse_kinematics, kinematic_scaling]
key_terms: [motion_retargeting, mocap, amass, smpl, kinematic_scaling, inverse_kinematics]
images: []
examples: []
open_questions:
  - "What AMASS motions have been successfully replayed on physical G1?"
  - "What is the end-to-end latency from mocap capture to robot execution?"
  - "Which retargeting approach gives best visual fidelity on G1 (IK vs. RL)?"
  - "Can video-based pose estimation (MediaPipe/OpenPose) provide sufficient accuracy for G1 retargeting?"
---

# Motion Capture & Retargeting

Capturing human motion and replaying it on the G1, including the kinematic mapping problem, data sources, and execution approaches.

## 1. The Retargeting Problem

A human has ~200+ degrees of freedom (skeleton + soft tissue). The G1 has 23-43 DOF. Retargeting must solve three mismatches: [T1 — Established robotics problem]

| Mismatch | Human | G1 (29-DOF) | Challenge |
|---|---|---|---|
| DOF count | ~200+ | 29 | Many human motions have no G1 equivalent |
| Limb proportions | Variable | Fixed (1.32m height, 0.6m legs, ~0.45m arms) | Workspace scaling needed |
| Joint ranges | Very flexible | Constrained (e.g., knee 0-165°, hip pitch ±154°) | Motions may exceed limits |
| Dynamics | ~70kg average | ~35kg, different mass distribution | Forces/torques don't scale linearly |

### What Works Well on G1
- Walking, standing, stepping motions
- Upper-body gestures (waving, pointing, reaching)
- Pick-and-place style manipulation
- Simple dance or expressive motions

### What's Difficult or Impossible
- Motions requiring finger dexterity (without hands attached)
- Deep squats or ground-level motions (joint limit violations)
- Fast acrobatic motions (torque/speed limits)
- Motions requiring more DOF than available (e.g., spine articulation with 1-DOF waist)

## 2. Retargeting Approaches

### 2a. IK-Based Retargeting (Classical)

Solve inverse kinematics to map human end-effector positions to G1 joint angles: [T1]

```
Pipeline:
Mocap data (human skeleton) → Extract key points (hands, feet, head, pelvis)
    → Scale to G1 proportions → Solve IK per frame → Smooth trajectory
    → Check joint limits → Execute or reject
```

**Tools:**
- **Pinocchio:** C++/Python rigid body dynamics with fast IK solver (see [[whole-body-control]])
- **MuJoCo IK:** Built-in inverse kinematics in MuJoCo simulator
- **Drake:** MIT's robotics toolbox with optimization-based IK
- **IKPy / ikflow:** Lightweight Python IK libraries

**Pros:** Fast, interpretable, no training required, deterministic
**Cons:** Frame-by-frame IK can produce jerky motions, doesn't account for dynamics/balance, may violate torque limits even if joint limits are satisfied

### 2b. Optimization-Based Retargeting

Solve a trajectory optimization over the full motion: [T1]

```
minimize    Σ_t || FK(q_t) - x_human_t ||^2          (tracking error)
          + Σ_t || q_t - q_{t-1} ||^2                 (smoothness)
subject to  q_min ≤ q_t ≤ q_max                       (joint limits)
            CoM_t ∈ support_polygon_t                   (balance)
            || tau_t || ≤ tau_max                       (torque limits)
            no self-collision                           (collision avoidance)
```

**Tools:** CasADi, Pinocchio + ProxQP, Drake, Crocoddyl
**Pros:** Globally smooth, respects all constraints, can enforce balance
**Cons:** Slow (offline only), requires accurate dynamics model, problem formulation complexity

### 2c. RL-Based Motion Tracking (Recommended for G1)

Train an RL policy that imitates reference motions while maintaining balance: [T1 — Multiple papers validated on G1]

```
Pipeline:
Mocap data → Retarget to G1 skeleton (rough IK) → Use as reference
    → Train RL policy in sim: reward = tracking + balance + energy
    → Deploy on real G1 via sim-to-real transfer
```

This is the approach used by BFM-Zero, H2O, OmniH2O, and HumanPlus. The RL policy learns to:
- Track the reference motion as closely as possible
- Maintain balance even when the reference motion would be unstable
- Respect joint and torque limits naturally (they're part of the sim environment)
- Recover from perturbations (if trained with perturbation curriculum)

**Key advantage:** Balance is baked into the policy — you don't need a separate balance controller.

### Key RL Motion Tracking Frameworks

| Framework | Paper | G1 Validated? | Key Feature |
|---|---|---|---|
| BFM-Zero | arXiv:2511.04131 | Yes | Zero-shot generalization to unseen motions, open-source |
| H2O | arXiv:2403.01623 | On humanoid (not G1 specifically) | Real-time teleoperation |
| OmniH2O | arXiv:2406.08858 | On humanoid | Multi-modal input (VR, RGB, mocap) |
| HumanPlus | arXiv:2406.10454 | On humanoid | RGB camera → shadow → imitate |
| GMT | Generic Motion Tracking | In sim | Tracks diverse AMASS motions |

### 2d. Hybrid Approach: IK + WBC

Use IK for the upper body, WBC for balance: [T1 — GR00T-WBC approach]

```
Mocap data → IK retarget (upper body only: arms, waist)
    → Feed to GR00T-WBC as upper-body targets
    → WBC locomotion policy handles legs/balance automatically
    → Execute on G1
```

This is likely the most practical near-term approach for the G1, using GR00T-WBC as the coordination layer. See [[whole-body-control]] for details.

## 3. Motion Capture Sources

### 3a. AMASS — Archive of Motion Capture as Surface Shapes
The largest publicly available human motion dataset: [T1]

| Property | Value |
|---|---|
| Motions | 11,000+ sequences from 15 mocap datasets |
| Format | SMPL body model parameters |
| G1 retarget | Available on HuggingFace (unitree) — pre-retargeted |
| License | Research use (check individual sub-datasets) |

**G1-specific:** Unitree has published AMASS motions retargeted to the G1 skeleton on HuggingFace. This provides ready-to-use reference trajectories for RL training or direct playback.

### 3b. CMU Motion Capture Database
Classic academic motion capture archive: [T1]

| Property | Value |
|---|---|
| Subjects | 144 subjects |
| Motions | 2,500+ sequences |
| Categories | Walking, running, sports, dance, interaction, etc. |
| Formats | BVH, C3D, ASF+AMC |
| License | Free for research |
| URL | mocap.cs.cmu.edu |

### 3c. Real-Time Sources (Live Mocap)

| Source | Device | Latency | Accuracy | G1 Integration |
|---|---|---|---|---|
| XR Teleoperate | Vision Pro, Quest 3, PICO 4 | Low (~50ms) | High (VR tracking) | Official (unitreerobotics/xr_teleoperate) |
| Kinect | Azure Kinect DK | Medium (~100ms) | Medium | Official (kinect_teleoperate) |
| MediaPipe | RGB camera | Low (~30ms) | Low-Medium | Community, needs retarget code |
| OpenPose | RGB camera | Medium | Medium | Community, needs retarget code |
| OptiTrack/Vicon | Marker-based system | Very low (~5ms) | Very high | Custom integration needed |

For the user's goal (mocap → robot), the XR teleoperation system is the most direct path for real-time, while AMASS provides offline motion libraries.

### 3d. Video-Based Pose Estimation
Extract human pose from standard RGB video without mocap hardware: [T2]

- **MediaPipe Pose:** 33 landmarks, real-time on CPU, Google
- **OpenPose:** 25 body keypoints, GPU required
- **HMR2.0 / 4DHumans:** SMPL mesh recovery from single image — richer than keypoints
- **MotionBERT:** Temporal pose estimation from video sequences

These are lower fidelity than marker-based mocap but require only a webcam. HumanPlus (arXiv:2406.10454) uses RGB camera input specifically for humanoid shadowing.

## 4. The Retargeting Pipeline

End-to-end pipeline from human motion to G1 execution:

```
┌─────────────┐     ┌──────────────┐     ┌───────────────┐
│ Motion       │     │ Skeleton     │     │ Kinematic      │
│ Source       │────►│ Extraction   │────►│ Retargeting    │
│ (mocap/video)│     │ (SMPL/joints)│     │ (scale + IK)   │
└─────────────┘     └──────────────┘     └───────┬───────┘
                                                  │
                                                  ▼
┌─────────────┐     ┌──────────────┐     ┌───────────────┐
│ Execute on   │     │ WBC / RL     │     │ Feasibility    │
│ Real G1      │◄───│ Policy       │◄───│ Check          │
│ (sdk2)       │     │ (balance +   │     │ (joint limits, │
└─────────────┘     │  tracking)   │     │  stability)    │
                    └──────────────┘     └───────────────┘
```

### Step 1: Motion Source
- Offline: AMASS dataset, CMU mocap, recorded demonstrations
- Real-time: XR headset, Kinect, RGB camera

### Step 2: Skeleton Extraction
- AMASS: Already in SMPL format, extract joint angles
- BVH/C3D: Parse standard mocap formats
- Video: Run pose estimator (MediaPipe, OpenPose, HMR2.0)
- Output: Human joint positions/rotations per frame

### Step 3: Kinematic Retargeting
- Map human skeleton to G1 skeleton (limb length scaling)
- Solve IK for each frame or use direct joint angle mapping
- Handle DOF mismatch (project higher-DOF human motion to G1 subspace)
- Clamp to G1 joint limits (see [[equations-and-bounds]])

### Step 4: Feasibility Check
- Verify all joint angles within limits
- Check CoM remains within support polygon (static stability)
- Estimate required torques (inverse dynamics) — reject if exceeding actuator limits
- Check for self-collisions

### Step 5: Execution Policy
- **Direct playback:** Send retargeted joint angles via rt/lowcmd (no balance guarantee)
- **WBC execution:** Feed to GR00T-WBC as upper-body targets, let locomotion policy handle balance
- **RL tracking:** Use trained motion tracking policy (BFM-Zero style) that simultaneously tracks and balances

### Step 6: Deploy on Real G1
- Via unitree_sdk2_python (prototyping) or unitree_sdk2 C++ (production)
- 500 Hz control loop, 2ms DDS latency
- Always validate in simulation first (see [[simulation]])

## 5. SMPL Body Model

SMPL (Skinned Multi-Person Linear model) is the standard representation for human body shape and pose in mocap datasets: [T1]

- **Parameters:** 72 pose parameters (24 joints x 3 rotations) + 10 shape parameters
- **Output:** 6,890 vertices mesh + joint locations
- **Extensions:** SMPL-X (hands + face), SMPL+H (hands)
- **Relevance:** AMASS uses SMPL, so retargeting from AMASS means mapping SMPL joints → G1 joints

### SMPL to G1 Joint Mapping (Approximate)

| SMPL Joint | G1 Joint(s) | Notes |
|---|---|---|
| Pelvis | Waist (yaw) | G1 has 1-3 waist DOF vs. SMPL's 3 |
| L/R Hip | left/right_hip_pitch/roll/yaw | Direct mapping, 3-DOF each |
| L/R Knee | left/right_knee | Direct mapping, 1-DOF |
| L/R Ankle | left/right_ankle_pitch/roll | Direct mapping, 2-DOF |
| L/R Shoulder | left/right_shoulder_pitch/roll/yaw | Direct mapping, 3-DOF |
| L/R Elbow | left/right_elbow | Direct mapping, 1-DOF |
| L/R Wrist | left/right_wrist_yaw(+pitch+roll) | 1-DOF (23-DOF) or 3-DOF (29-DOF) |
| Spine | Waist (limited) | SMPL has 3 spine joints, G1 has 1-3 waist |
| Head/Neck | — | G1 has no head/neck DOF |
| Fingers | Hand joints (if equipped) | Only with Dex3-1 or INSPIRE |

## 6. Key Software & Repositories

| Tool | Purpose | Language | License |
|---|---|---|---|
| GR00T-WBC | End-to-end WBC + retargeting for G1 | Python/C++ | Apache 2.0 |
| Pinocchio | Rigid body dynamics, IK, Jacobians | C++/Python | BSD-2 |
| xr_teleoperate | Real-time VR mocap → G1 | Python | Unitree |
| unitree_mujoco | Simulate retargeted motions | C++/Python | BSD-3 |
| smplx (Python) | SMPL body model processing | Python | MIT |
| rofunc | Robot learning from human demos + retargeting | Python | MIT |
| MuJoCo Menagerie | G1 model (g1.xml) for IK/simulation | MJCF | BSD-3 |

## 6. Apple Vision Pro Telepresence Paths (Researched 2026-02-15) [T1/T2]

### Available Integration Options

| Path | Approach | App Required? | GR00T-WBC Compatible? | Retargeting |
|------|----------|:---:|:---:|---|
| xr_teleoperate | WebXR via Safari | No (browser) | No (uses stock SDK) | Pinocchio IK |
| VisionProTeleop | Native visionOS app | Yes (App Store / open-source) | Yes (via bridge) | Custom (flexible) |
| iPhone streamer | Socket.IO protocol | Custom visionOS app | Yes (built-in) | Pinocchio IK in GR00T-WBC |

### xr_teleoperate (Unitree Official)
- Vision Pro connects via Safari to `https://<host>:8012` (WebXR)
- TeleVuer (Python, built on Vuer) serves the 3D interface
- WebSocket for tracking data, WebRTC for video feedback
- Pinocchio IK solves wrist poses → G1 arm joint angles
- Supports G1_29 and G1_23 variants
- **Limitation:** Bypasses GR00T-WBC — sends motor commands directly via DDS

### VisionProTeleop (MIT, Open-Source)
- Native visionOS app "Tracking Streamer" — on App Store + source on GitHub
- Python library `avp_stream` receives data via gRPC
- 25 finger joints/hand, head pose, wrist positions (native ARKit, better than WebXR)
- Robot-agnostic — needs a bridge to publish to GR00T-WBC's `ControlPolicy/upper_body_pose` ROS2 topic
- **Best path for GR00T-WBC integration with RL-based balance**

### GR00T-WBC Integration Point
The single integration point is the `ControlPolicy/upper_body_pose` ROS2 topic. Any source that publishes `target_upper_body_pose` (17 joint angles: 3 waist + 7 left arm + 7 right arm) and optionally `navigate_cmd` (velocity `[vx, vy, wz]`) can drive the robot. The `InterpolationPolicy` smooths targets before execution.

## Key Relationships
- Requires: [[joint-configuration]] (target skeleton — DOF, joint limits, link lengths)
- Executed via: [[whole-body-control]] (WBC provides balance during playback)
- Stabilized by: [[push-recovery-balance]] (perturbation robustness during execution)
- Trained in: [[simulation]] (RL tracking policies trained in MuJoCo/Isaac)
- Training methods: [[learning-and-ai]] (RL, imitation learning frameworks)
- Bounded by: [[equations-and-bounds]] (joint limits, torque limits for feasibility)