Towards Physical Intent-Driven General Motion Imitation in Humanoid Teleoperation
Teleoperation with Sparse IMU-based MoCap
All videos are real-time teleoperation with no speed-up.
Teleoperated Expressive Motions
System Overview
Dynamic Motions
Balancing
Squat
"Catch"
Tennis
Lightweight and Camera-free
Boxing in the Dark
Behind Occlusion 1
Behind Occlusion 2
| MoCap Source | Portable | Lightweight | Unbounded Area |
Lighting Independent |
Cost |
|---|---|---|---|---|---|
| Optical Marker | ✘ | ✘ | ✘ | ✘ | $50k–100k+ |
| RGB Camera | ✔ | ✔ | ✘ | ✘ | $100–250+ |
| VR System | ✔ | ✘ | ✘ | ✘ | $1k–4.5k+ |
| Dense IMU | ✔ | ✘ | ✔ | ✔ | $3k–12k+ |
| Sparse IMU (Ours) | ✔ | ✔ | ✔ | ✔ | $200–1000+ |
Sim2Sim Validation
Miserable
Sad
Locomotion
Abstract
Current physics-based humanoid teleoperation frameworks predominantly treat the task as superficial kinematic mimicry, attempting to strictly track reference joint positions. However, the theoretical ceiling of this paradigm is merely reproducing the exact reference motion, which inherently suffers from retargeting artifacts. We argue that the ultimate goal of human-to-robot teleoperation is to transfer semantic intent across morphological gaps, rather than blindly replicating joint trajectories. Achieving this paradigm shift unlocks high-quality teleoperation without perfectly clean kinematic data.
While recent methods attempt to learn implicit intent through massive model scaling or token embeddings, they remain computationally heavy and indirect. In this paper, we bridge this gap by introducing a lightweight, RL-level framework that explicitly shifts the imitation learning paradigm from kinematic tracking to intent decoding. We achieve this from two fundamental angles: (1) reconstructing the imitation objective to prioritize physical intent over joint fidelity, and (2) introducing structured data degradation to force the policy to learn the underlying intent from corrupted inputs. Experiments demonstrate that our method yields massive performance leaps purely at the RL level. Furthermore, to validate this paradigm shift under extreme conditions, we deploy our framework to build a minimalist 5-IMU humanoid teleoperation system.
Citation
If you find this work useful, please consider citing: