Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pre-Training

Yaru Niu1,*, Yunzhe Zhang1,*, Mingyang Yu1, Changyi Lin1, Chenhao Li1, Yikai Wang1, Yuxiang Yang2, Wenhao Yu2, Tingnan Zhang2, Bingqing Chen3, Jonathan Francis1,3, Zhenzhen Li3, Jie Tan2, Ding Zhao1
1Carnegie Mellon University, 2Google DeepMind 3Bosch Center for Artificial Intelligence

Abstract

Quadrupedal robots have demonstrated impressive locomotion capabilities in complex environments, but equipping them with autonomous versatile manipulation skills in a scalable way remains a significant challenge. In this work, we introduce a system that integrates data collection and imitation learning from both humans and LocoMan, a quadrupedal robot with multiple manipulation modes. Specifically, we introduce a teleoperation and data collection pipeline, supported by dedicated hardware, which unifies and modularizes the observation and action spaces of the human and the robot. To effectively leverage the collected data, we propose an efficient learning architecture that supports co-training and pre-training with multimodal data across different embodiments. Additionally, we construct the first manipulation dataset for the LocoMan robot, covering various household tasks in both unimanual and bimanual modes, supplemented by a corresponding human dataset. Experimental results demonstrate that our data collection and training framework significantly improves the efficiency and effectiveness of imitation learning, enabling more versatile quadrupedal manipulation capabilities.

Learning Framework

Teleoperation & Data Collection System

We develop a unified system for both robot teleoperation and human data collection. Our teleoperation system controls the 6D poses of end-effectors, gripper actions, and the rigid body where the stereo camera is mounted (the torso of LocoMan, the head of human). The coordination of whole-body motions expand the workspace and camera sensing ranges. Our teleoperation system is fast-response, stable, and user frendly, by incorporating interpolations for fast movements, handles and recovery from self-collisions, singularity, ground-collisions, and joint limits.

We mount a stereo camera on the VR set to collect first-person view of human videos and capture human's hand and head motions within a unified coordinate frames as the robot embodiments. Through the simple interface we designed, a human operator without robot access can easily collect dozens of trajectories by themselves within minutes.

Modularized Cross-Embodiment Transformer (MXT)

Our learning framework is designed to efficiently use data from both human and robot sources. We propose a modularized design Modularized Cross-Embodiment Transformer (MXT). MXT consists mainly of three groups of modules: tokenizers, transformer trunk, and detokenizers. The tokenizers act as encoders and map embodiment-specific observations to tokens in the latent space, and the detokenizers translate the output tokens from the trunk to actions in the action space of each embodiment. The tokenizers and detokenizers are specific to one embodiment and are reinitialized for each new embodiment, while the trunk is shared across all embodiments and reused for transferring the policy among embodiments.

Learning Framework

Autonomous Policy

Robust to OOD Objects

MXT

MXT (not pretrained)

MXT

HIT

Data-Efficient Finetuning

MXT, finetuned with 2-min robot data (success 5/5)

MXT (not pretrained), trained with 4-min robot data (success 2/5)