ECHO

An Edge–Cloud Framework for
Language-Driven Whole-Body Control
of Humanoid Robots
Haozhe Jia1,2,* Jianfei Song2,* Yuan Zhang3,* Honglei Jin1,3 Youcheng Fan1 Wenshuo Chen1 Wei Zhang2,† Yutao Yue1,4,‡
1HKUST(GZ) 2LimX Dynamics 3Shandong University 4JITRI
*Equal contribution · †Co-corresponding author · ‡Lead corresponding author
🤖 Diffusion Generation · ☁️ Edge–Cloud · 🦾 Unitree G1

ECHO: An Edge–Cloud Framework for Language-Driven Whole-Body Control of Humanoid Robots

Abstract

We present ECHO, an edge–cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher–Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves state-of-the-art generation quality (FID 0.029, R-Precision Top-1 0.693) while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning. Code and models will be released.

Teaser figure

Teaser figure description.

Motion Gallery

21 text-driven motions — Simulation (top) vs. Real Unitree G1 (bottom).

Browser Recommendation: For the best viewing experience, we highly recommend using Google Chrome. You may experience incomplete video loading if using Safari.

Detailed Results

Simulation (left) vs. Real-World Unitree G1 (right) — driven by the same text prompt.