1.9 KiB

Raw Blame History

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Source: https://arxiv.org/abs/2602.06382 Fetched: 2026-02-13 Type: Research Paper

Paper Information

arXiv ID: 2602.06382
Authors: Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Finn Yan, Ethan Xie, Zongwu Xie
Submission Date: February 6, 2026

Abstract

The researchers present "an end-to-end framework for vision-driven humanoid locomotion" addressing two key challenges: perception noise from sim-to-real transfer and conflicting learning objectives across diverse terrains.

Core Contribution

This paper proposes an end-to-end approach for humanoid locomotion that operates directly from raw depth pixel input, eliminating the need for separate perception and control modules.

Technical Approach

Perception Realism

The team developed high-fidelity depth simulation capturing "stereo matching artifacts and calibration uncertainties inherent in real-world sensing."

Knowledge Transfer: Vision-Aware Behavior Distillation

They propose "vision-aware behavior distillation" combining latent space alignment with noise-invariant auxiliary tasks to transfer knowledge from privileged height maps to noisy depth observations.

Terrain Versatility

The approach integrates "terrain-specific reward shaping" with multi-critic and multi-discriminator learning to handle distinct dynamics across different terrain types.

Validation

The policy was tested on humanoid platforms with stereo depth cameras, demonstrating capability across extreme challenges (high platforms, wide gaps) and fine-grained tasks including bidirectional staircase traversal.

Significance

This work advances vision-based locomotion by directly bridging the sim-to-real gap for depth-based perception, enabling humanoid robots to traverse challenging terrains without hand-crafted perception pipelines.

1.9 KiB Raw Blame History