1.9 KiB
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
Source: https://arxiv.org/abs/2602.06382 Fetched: 2026-02-13 Type: Research Paper
Paper Information
- arXiv ID: 2602.06382
- Authors: Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Finn Yan, Ethan Xie, Zongwu Xie
- Submission Date: February 6, 2026
Abstract
The researchers present "an end-to-end framework for vision-driven humanoid locomotion" addressing two key challenges: perception noise from sim-to-real transfer and conflicting learning objectives across diverse terrains.
Core Contribution
This paper proposes an end-to-end approach for humanoid locomotion that operates directly from raw depth pixel input, eliminating the need for separate perception and control modules.
Technical Approach
Perception Realism
The team developed high-fidelity depth simulation capturing "stereo matching artifacts and calibration uncertainties inherent in real-world sensing."
Knowledge Transfer: Vision-Aware Behavior Distillation
They propose "vision-aware behavior distillation" combining latent space alignment with noise-invariant auxiliary tasks to transfer knowledge from privileged height maps to noisy depth observations.
Terrain Versatility
The approach integrates "terrain-specific reward shaping" with multi-critic and multi-discriminator learning to handle distinct dynamics across different terrain types.
Validation
The policy was tested on humanoid platforms with stereo depth cameras, demonstrating capability across extreme challenges (high platforms, wide gaps) and fine-grained tasks including bidirectional staircase traversal.
Significance
This work advances vision-based locomotion by directly bridging the sim-to-real gap for depth-based perception, enabling humanoid robots to traverse challenging terrains without hand-crafted perception pipelines.