--- id: motion-retargeting title: "Motion Capture & Retargeting" status: established source_sections: "reference/sources/paper-bfm-zero.md, reference/sources/paper-h2o.md, reference/sources/paper-omnih2o.md, reference/sources/paper-humanplus.md, reference/sources/dataset-amass-g1.md, reference/sources/github-groot-wbc.md, reference/sources/community-mocap-retarget-tools.md" related_topics: [whole-body-control, joint-configuration, simulation, learning-and-ai, equations-and-bounds, push-recovery-balance] key_equations: [inverse_kinematics, kinematic_scaling] key_terms: [motion_retargeting, mocap, amass, smpl, kinematic_scaling, inverse_kinematics] images: [] examples: [] open_questions: - "What AMASS motions have been successfully replayed on physical G1?" - "What is the end-to-end latency from mocap capture to robot execution?" - "Which retargeting approach gives best visual fidelity on G1 (IK vs. RL)?" - "Can video-based pose estimation (MediaPipe/OpenPose) provide sufficient accuracy for G1 retargeting?" --- # Motion Capture & Retargeting Capturing human motion and replaying it on the G1, including the kinematic mapping problem, data sources, and execution approaches. ## 1. The Retargeting Problem A human has ~200+ degrees of freedom (skeleton + soft tissue). The G1 has 23-43 DOF. Retargeting must solve three mismatches: [T1 — Established robotics problem] | Mismatch | Human | G1 (29-DOF) | Challenge | |---|---|---|---| | DOF count | ~200+ | 29 | Many human motions have no G1 equivalent | | Limb proportions | Variable | Fixed (1.32m height, 0.6m legs, ~0.45m arms) | Workspace scaling needed | | Joint ranges | Very flexible | Constrained (e.g., knee 0-165°, hip pitch ±154°) | Motions may exceed limits | | Dynamics | ~70kg average | ~35kg, different mass distribution | Forces/torques don't scale linearly | ### What Works Well on G1 - Walking, standing, stepping motions - Upper-body gestures (waving, pointing, reaching) - Pick-and-place style manipulation - Simple dance or expressive motions ### What's Difficult or Impossible - Motions requiring finger dexterity (without hands attached) - Deep squats or ground-level motions (joint limit violations) - Fast acrobatic motions (torque/speed limits) - Motions requiring more DOF than available (e.g., spine articulation with 1-DOF waist) ## 2. Retargeting Approaches ### 2a. IK-Based Retargeting (Classical) Solve inverse kinematics to map human end-effector positions to G1 joint angles: [T1] ``` Pipeline: Mocap data (human skeleton) → Extract key points (hands, feet, head, pelvis) → Scale to G1 proportions → Solve IK per frame → Smooth trajectory → Check joint limits → Execute or reject ``` **Tools:** - **Pinocchio:** C++/Python rigid body dynamics with fast IK solver (see [[whole-body-control]]) - **MuJoCo IK:** Built-in inverse kinematics in MuJoCo simulator - **Drake:** MIT's robotics toolbox with optimization-based IK - **IKPy / ikflow:** Lightweight Python IK libraries **Pros:** Fast, interpretable, no training required, deterministic **Cons:** Frame-by-frame IK can produce jerky motions, doesn't account for dynamics/balance, may violate torque limits even if joint limits are satisfied ### 2b. Optimization-Based Retargeting Solve a trajectory optimization over the full motion: [T1] ``` minimize Σ_t || FK(q_t) - x_human_t ||^2 (tracking error) + Σ_t || q_t - q_{t-1} ||^2 (smoothness) subject to q_min ≤ q_t ≤ q_max (joint limits) CoM_t ∈ support_polygon_t (balance) || tau_t || ≤ tau_max (torque limits) no self-collision (collision avoidance) ``` **Tools:** CasADi, Pinocchio + ProxQP, Drake, Crocoddyl **Pros:** Globally smooth, respects all constraints, can enforce balance **Cons:** Slow (offline only), requires accurate dynamics model, problem formulation complexity ### 2c. RL-Based Motion Tracking (Recommended for G1) Train an RL policy that imitates reference motions while maintaining balance: [T1 — Multiple papers validated on G1] ``` Pipeline: Mocap data → Retarget to G1 skeleton (rough IK) → Use as reference → Train RL policy in sim: reward = tracking + balance + energy → Deploy on real G1 via sim-to-real transfer ``` This is the approach used by BFM-Zero, H2O, OmniH2O, and HumanPlus. The RL policy learns to: - Track the reference motion as closely as possible - Maintain balance even when the reference motion would be unstable - Respect joint and torque limits naturally (they're part of the sim environment) - Recover from perturbations (if trained with perturbation curriculum) **Key advantage:** Balance is baked into the policy — you don't need a separate balance controller. ### Key RL Motion Tracking Frameworks | Framework | Paper | G1 Validated? | Key Feature | |---|---|---|---| | BFM-Zero | arXiv:2511.04131 | Yes | Zero-shot generalization to unseen motions, open-source | | H2O | arXiv:2403.01623 | On humanoid (not G1 specifically) | Real-time teleoperation | | OmniH2O | arXiv:2406.08858 | On humanoid | Multi-modal input (VR, RGB, mocap) | | HumanPlus | arXiv:2406.10454 | On humanoid | RGB camera → shadow → imitate | | GMT | Generic Motion Tracking | In sim | Tracks diverse AMASS motions | ### 2d. Hybrid Approach: IK + WBC Use IK for the upper body, WBC for balance: [T1 — GR00T-WBC approach] ``` Mocap data → IK retarget (upper body only: arms, waist) → Feed to GR00T-WBC as upper-body targets → WBC locomotion policy handles legs/balance automatically → Execute on G1 ``` This is likely the most practical near-term approach for the G1, using GR00T-WBC as the coordination layer. See [[whole-body-control]] for details. ## 3. Motion Capture Sources ### 3a. AMASS — Archive of Motion Capture as Surface Shapes The largest publicly available human motion dataset: [T1] | Property | Value | |---|---| | Motions | 11,000+ sequences from 15 mocap datasets | | Format | SMPL body model parameters | | G1 retarget | Available on HuggingFace (unitree) — pre-retargeted | | License | Research use (check individual sub-datasets) | **G1-specific:** Unitree has published AMASS motions retargeted to the G1 skeleton on HuggingFace. This provides ready-to-use reference trajectories for RL training or direct playback. ### 3b. CMU Motion Capture Database Classic academic motion capture archive: [T1] | Property | Value | |---|---| | Subjects | 144 subjects | | Motions | 2,500+ sequences | | Categories | Walking, running, sports, dance, interaction, etc. | | Formats | BVH, C3D, ASF+AMC | | License | Free for research | | URL | mocap.cs.cmu.edu | ### 3c. Real-Time Sources (Live Mocap) | Source | Device | Latency | Accuracy | G1 Integration | |---|---|---|---|---| | XR Teleoperate | Vision Pro, Quest 3, PICO 4 | Low (~50ms) | High (VR tracking) | Official (unitreerobotics/xr_teleoperate) | | Kinect | Azure Kinect DK | Medium (~100ms) | Medium | Official (kinect_teleoperate) | | MediaPipe | RGB camera | Low (~30ms) | Low-Medium | Community, needs retarget code | | OpenPose | RGB camera | Medium | Medium | Community, needs retarget code | | OptiTrack/Vicon | Marker-based system | Very low (~5ms) | Very high | Custom integration needed | For the user's goal (mocap → robot), the XR teleoperation system is the most direct path for real-time, while AMASS provides offline motion libraries. ### 3d. Video-Based Pose Estimation Extract human pose from standard RGB video without mocap hardware: [T2] - **MediaPipe Pose:** 33 landmarks, real-time on CPU, Google - **OpenPose:** 25 body keypoints, GPU required - **HMR2.0 / 4DHumans:** SMPL mesh recovery from single image — richer than keypoints - **MotionBERT:** Temporal pose estimation from video sequences These are lower fidelity than marker-based mocap but require only a webcam. HumanPlus (arXiv:2406.10454) uses RGB camera input specifically for humanoid shadowing. ## 4. The Retargeting Pipeline End-to-end pipeline from human motion to G1 execution: ``` ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │ Motion │ │ Skeleton │ │ Kinematic │ │ Source │────►│ Extraction │────►│ Retargeting │ │ (mocap/video)│ │ (SMPL/joints)│ │ (scale + IK) │ └─────────────┘ └──────────────┘ └───────┬───────┘ │ ▼ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │ Execute on │ │ WBC / RL │ │ Feasibility │ │ Real G1 │◄───│ Policy │◄───│ Check │ │ (sdk2) │ │ (balance + │ │ (joint limits, │ └─────────────┘ │ tracking) │ │ stability) │ └──────────────┘ └───────────────┘ ``` ### Step 1: Motion Source - Offline: AMASS dataset, CMU mocap, recorded demonstrations - Real-time: XR headset, Kinect, RGB camera ### Step 2: Skeleton Extraction - AMASS: Already in SMPL format, extract joint angles - BVH/C3D: Parse standard mocap formats - Video: Run pose estimator (MediaPipe, OpenPose, HMR2.0) - Output: Human joint positions/rotations per frame ### Step 3: Kinematic Retargeting - Map human skeleton to G1 skeleton (limb length scaling) - Solve IK for each frame or use direct joint angle mapping - Handle DOF mismatch (project higher-DOF human motion to G1 subspace) - Clamp to G1 joint limits (see [[equations-and-bounds]]) ### Step 4: Feasibility Check - Verify all joint angles within limits - Check CoM remains within support polygon (static stability) - Estimate required torques (inverse dynamics) — reject if exceeding actuator limits - Check for self-collisions ### Step 5: Execution Policy - **Direct playback:** Send retargeted joint angles via rt/lowcmd (no balance guarantee) - **WBC execution:** Feed to GR00T-WBC as upper-body targets, let locomotion policy handle balance - **RL tracking:** Use trained motion tracking policy (BFM-Zero style) that simultaneously tracks and balances ### Step 6: Deploy on Real G1 - Via unitree_sdk2_python (prototyping) or unitree_sdk2 C++ (production) - 500 Hz control loop, 2ms DDS latency - Always validate in simulation first (see [[simulation]]) ## 5. SMPL Body Model SMPL (Skinned Multi-Person Linear model) is the standard representation for human body shape and pose in mocap datasets: [T1] - **Parameters:** 72 pose parameters (24 joints x 3 rotations) + 10 shape parameters - **Output:** 6,890 vertices mesh + joint locations - **Extensions:** SMPL-X (hands + face), SMPL+H (hands) - **Relevance:** AMASS uses SMPL, so retargeting from AMASS means mapping SMPL joints → G1 joints ### SMPL to G1 Joint Mapping (Approximate) | SMPL Joint | G1 Joint(s) | Notes | |---|---|---| | Pelvis | Waist (yaw) | G1 has 1-3 waist DOF vs. SMPL's 3 | | L/R Hip | left/right_hip_pitch/roll/yaw | Direct mapping, 3-DOF each | | L/R Knee | left/right_knee | Direct mapping, 1-DOF | | L/R Ankle | left/right_ankle_pitch/roll | Direct mapping, 2-DOF | | L/R Shoulder | left/right_shoulder_pitch/roll/yaw | Direct mapping, 3-DOF | | L/R Elbow | left/right_elbow | Direct mapping, 1-DOF | | L/R Wrist | left/right_wrist_yaw(+pitch+roll) | 1-DOF (23-DOF) or 3-DOF (29-DOF) | | Spine | Waist (limited) | SMPL has 3 spine joints, G1 has 1-3 waist | | Head/Neck | — | G1 has no head/neck DOF | | Fingers | Hand joints (if equipped) | Only with Dex3-1 or INSPIRE | ## 6. Key Software & Repositories | Tool | Purpose | Language | License | |---|---|---|---| | GR00T-WBC | End-to-end WBC + retargeting for G1 | Python/C++ | Apache 2.0 | | Pinocchio | Rigid body dynamics, IK, Jacobians | C++/Python | BSD-2 | | xr_teleoperate | Real-time VR mocap → G1 | Python | Unitree | | unitree_mujoco | Simulate retargeted motions | C++/Python | BSD-3 | | smplx (Python) | SMPL body model processing | Python | MIT | | rofunc | Robot learning from human demos + retargeting | Python | MIT | | MuJoCo Menagerie | G1 model (g1.xml) for IK/simulation | MJCF | BSD-3 | ## 6. Apple Vision Pro Telepresence Paths (Researched 2026-02-15) [T1/T2] ### Available Integration Options | Path | Approach | App Required? | GR00T-WBC Compatible? | Retargeting | |------|----------|:---:|:---:|---| | xr_teleoperate | WebXR via Safari | No (browser) | No (uses stock SDK) | Pinocchio IK | | VisionProTeleop | Native visionOS app | Yes (App Store / open-source) | Yes (via bridge) | Custom (flexible) | | iPhone streamer | Socket.IO protocol | Custom visionOS app | Yes (built-in) | Pinocchio IK in GR00T-WBC | ### xr_teleoperate (Unitree Official) - Vision Pro connects via Safari to `https://:8012` (WebXR) - TeleVuer (Python, built on Vuer) serves the 3D interface - WebSocket for tracking data, WebRTC for video feedback - Pinocchio IK solves wrist poses → G1 arm joint angles - Supports G1_29 and G1_23 variants - **Limitation:** Bypasses GR00T-WBC — sends motor commands directly via DDS ### VisionProTeleop (MIT, Open-Source) - Native visionOS app "Tracking Streamer" — on App Store + source on GitHub - Python library `avp_stream` receives data via gRPC - 25 finger joints/hand, head pose, wrist positions (native ARKit, better than WebXR) - Robot-agnostic — needs a bridge to publish to GR00T-WBC's `ControlPolicy/upper_body_pose` ROS2 topic - **Best path for GR00T-WBC integration with RL-based balance** ### GR00T-WBC Integration Point The single integration point is the `ControlPolicy/upper_body_pose` ROS2 topic. Any source that publishes `target_upper_body_pose` (17 joint angles: 3 waist + 7 left arm + 7 right arm) and optionally `navigate_cmd` (velocity `[vx, vy, wz]`) can drive the robot. The `InterpolationPolicy` smooths targets before execution. ## Key Relationships - Requires: [[joint-configuration]] (target skeleton — DOF, joint limits, link lengths) - Executed via: [[whole-body-control]] (WBC provides balance during playback) - Stabilized by: [[push-recovery-balance]] (perturbation robustness during execution) - Trained in: [[simulation]] (RL tracking policies trained in MuJoCo/Isaac) - Training methods: [[learning-and-ai]] (RL, imitation learning frameworks) - Bounded by: [[equations-and-bounds]] (joint limits, torque limits for feasibility)