You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

1.9 KiB

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Source: https://arxiv.org/abs/2602.06382 Fetched: 2026-02-13 Type: Research Paper


Paper Information

  • arXiv ID: 2602.06382
  • Authors: Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Finn Yan, Ethan Xie, Zongwu Xie
  • Submission Date: February 6, 2026

Abstract

The researchers present "an end-to-end framework for vision-driven humanoid locomotion" addressing two key challenges: perception noise from sim-to-real transfer and conflicting learning objectives across diverse terrains.

Core Contribution

This paper proposes an end-to-end approach for humanoid locomotion that operates directly from raw depth pixel input, eliminating the need for separate perception and control modules.

Technical Approach

Perception Realism

The team developed high-fidelity depth simulation capturing "stereo matching artifacts and calibration uncertainties inherent in real-world sensing."

Knowledge Transfer: Vision-Aware Behavior Distillation

They propose "vision-aware behavior distillation" combining latent space alignment with noise-invariant auxiliary tasks to transfer knowledge from privileged height maps to noisy depth observations.

Terrain Versatility

The approach integrates "terrain-specific reward shaping" with multi-critic and multi-discriminator learning to handle distinct dynamics across different terrain types.

Validation

The policy was tested on humanoid platforms with stereo depth cameras, demonstrating capability across extreme challenges (high platforms, wide gaps) and fine-grained tasks including bidirectional staircase traversal.

Significance

This work advances vision-based locomotion by directly bridging the sim-to-real gap for depth-based perception, enabling humanoid robots to traverse challenging terrains without hand-crafted perception pipelines.