Quadrupedal robots have demonstrated impressive locomotion capabilities in complex environments, but equipping them with autonomous versatile manipulation skills in a scalable way remains a significant challenge. In this work, we introduce a system that integrates data collection and imitation learning from both humans and LocoMan, a quadrupedal robot with multiple manipulation modes. Specifically, we introduce a teleoperation and data collection pipeline, supported by dedicated hardware, which unifies and modularizes the observation and action spaces of the human and the robot. To effectively leverage the collected data, we propose an efficient learning architecture that supports co-training and pre-training with multimodal data across different embodiments. Additionally, we construct the first manipulation dataset for the LocoMan robot, covering various household tasks in both unimanual and bimanual modes, supplemented by a corresponding human dataset. Experimental results demonstrate that our data collection and training framework significantly improves the efficiency and effectiveness of imitation learning, enabling more versatile quadrupedal manipulation capabilities.
We develop a unified system for both robot teleoperation and human data collection. Our teleoperation system controls the 6D poses of end-effectors, gripper actions, and the rigid body where the stereo camera is mounted (the torso of LocoMan, the head of human). The coordination of whole-body motions expand the workspace and camera sensing ranges. Our teleoperation system is fast-response, stable, and user frendly, by incorporating interpolations for fast movements, handles and recovery from self-collisions, singularity, ground-collisions, and joint limits.
We mount a stereo camera on the VR set to collect first-person view of human videos and capture human's hand and head motions within a unified coordinate frames as the robot embodiments. Through the simple interface we designed, a human operator without robot access can easily collect dozens of trajectories by themselves within minutes.
Our learning framework is designed to efficiently use data from both human and robot sources. We propose a modularized design Modularized Cross-Embodiment Transformer (MXT). MXT consists mainly of three groups of modules: tokenizers, transformer trunk, and detokenizers. The tokenizers act as encoders and map embodiment-specific observations to tokens in the latent space, and the detokenizers translate the output tokens from the trunk to actions in the action space of each embodiment. The tokenizers and detokenizers are specific to one embodiment and are reinitialized for each new embodiment, while the trunk is shared across all embodiments and reused for transferring the policy among embodiments.