Under Review

Towards Physical Intent-Driven General Motion Imitation in Humanoid Teleoperation

Anonymous Authors

Paper (PDF) arXiv Code Video

Teleoperation with Sparse IMU-based MoCap

All videos are real-time teleoperation with no speed-up.

Teleoperated Expressive Motions

System Overview

Dynamic Motions

Balancing

Squat

"Catch"

Tennis

Lightweight and Camera-free

Boxing in the Dark

Behind Occlusion 1

Behind Occlusion 2

Table 1. Comparison of different MoCap systems for humanoid teleoperation. Our sparse IMU setup uniquely achieves a highly portable, lighting-independent, and unbounded solution with a lightweight physical setup.

MoCap Source	Portable	Lightweight	Unbounded Area	Lighting Independent	Cost
Optical Marker	✘	✘	✘	✘	$50k–100k+
RGB Camera	✔	✔	✘	✘	$100–250+
VR System	✔	✘	✘	✘	$1k–4.5k+
Dense IMU	✔	✘	✔	✔	$3k–12k+
Sparse IMU (Ours)	✔	✔	✔	✔	$200–1000+

Sim2Sim Validation

Miserable

Sad

Locomotion

Abstract

Current physics-based humanoid teleoperation frameworks predominantly treat the task as superficial kinematic mimicry, attempting to strictly track reference joint positions. However, the theoretical ceiling of this paradigm is merely reproducing the exact reference motion, which inherently suffers from retargeting artifacts. We argue that the ultimate goal of human-to-robot teleoperation is to transfer semantic intent across morphological gaps, rather than blindly replicating joint trajectories. Achieving this paradigm shift unlocks high-quality teleoperation without perfectly clean kinematic data.

While recent methods attempt to learn implicit intent through massive model scaling or token embeddings, they remain computationally heavy and indirect. In this paper, we bridge this gap by introducing a lightweight, RL-level framework that explicitly shifts the imitation learning paradigm from kinematic tracking to intent decoding. We achieve this from two fundamental angles: (1) reconstructing the imitation objective to prioritize physical intent over joint fidelity, and (2) introducing structured data degradation to force the policy to learn the underlying intent from corrupted inputs. Experiments demonstrate that our method yields massive performance leaps purely at the RL level. Furthermore, to validate this paradigm shift under extreme conditions, we deploy our framework to build a minimalist 5-IMU humanoid teleoperation system.

Take-away

Fig. 1. (Top) Cross-morphology retargeting inevitably produces flawed, physically unfeasible reference motions. (Middle) Prior works relying on strict kinematic tracking fail catastrophically when forced to mimic these unattainable postures. (Bottom) Our training paradigm conditions the policy to focus on physical intent rather than rigid joint matching, resulting in robust general motion tracking despite reference artifacts.

Citation

If you find this work useful, please consider citing:

Note: Citation content is currently commented out in the HTML. Uncomment and replace with your actual BibTeX once available.