A Humanoid Robot Learns to Stand by Itself

▶ Side-by-Side

Iteration 1 → 4 → 6 in One Take

The full learning arc at 3× speed. Left: iteration 1 collapses immediately. Middle: iteration 4 stands but drifts. Right: iteration 6 — upright, stable, minimal sway.

Side-by-side comparison, 3× playback speed. All three clips are 1000-step episodes (~10 seconds of simulation time).

1 Starting Point

A Robot That Collapses Without Control

Load the Unitree H1 into MuJoCo — a physics engine that simulates gravity, mass and joints accurately — and release it without a controller. It falls immediately. That is the expected baseline. Standing is active work.

🤖

19 Motors

Hips, knees, ankles, torso and arms — each motor must deliver the right torque at every timestep.

🎶

25 Degrees of Freedom

25 moving axes that must stay balanced simultaneously — in milliseconds, continuously.

🧠

Not Solvable by Hand

No human can manually coordinate 19 motors fast enough to keep a humanoid upright. The controller must be learned.

2 Making the Goal Verifiable

"Learn to Stand" Isn't Testable. This Is.

A vague objective cannot be verified. The first step was translating the intent into a concrete success criterion.

🎯

Verifiable criterion

Hold the torso at approximately 0.98 m height for 1000 simulation steps (~10 seconds), upright, without falling.

0.98 m is the torso height in the H1's default upright pose. Reaching this criterion is measurable, not asserted.

3 Approach

Three Options Weighed — One Chosen

Before writing code, three realistic paths were evaluated.

PPO with a custom MuJoCo environment Chosen

A custom training environment built around the actual H1 MJCF model, trained with the PPO algorithm (Stable-Baselines3). Full control over observations, rewards and termination logic — and the exact robot in question.

MuJoCo Playground / MJX Discarded

MJX massively accelerates simulation — but only with a GPU. On a CPU-only machine it adds complexity without benefit.

Off-the-shelf humanoid environment Discarded

Pre-built environments use a generic humanoid, not the Unitree H1. Results would not transfer to the real hardware.

Component	Version	Role
PyTorch	2.x (CPU)	Neural network & training
MuJoCo	3.8	Physics simulation
Gymnasium	1.2	Environment interface
Stable-Baselines3	2.8	PPO implementation

4 The Six Iterations

From 0 % to 96.7 % — Step by Step

Each iteration changed one thing, measured the outcome, and traced the improvement or failure back to a specific design decision.

Iteration	Steps	Success	Mean stand time	Key change
1 — Reward v1	3 M	0 %	159	Baseline: alive-bonus + height + upright. Robot exploits reward, flails.
2 — Pose reference	5 M	5 %	456	Human-in-the-loop: explicit stand-pose reference added as dominant reward term.
3 — Resume	15 M	25 %	594	Continued from Iter 2. Upward trend but training instability (clip_fraction 0.76).
4 — Stabilised	25 M	83.3 %	930	Learning rate 3e-4 → 1e-4. Clean convergence; robot stands upright.
5 — Stillness	35 M	90 %	949	Velocity penalty over all joints. XY-drift cut from 3.4 m to 0.53 m.
6 — Lower/Upper + Foot-Lock	45 M	96.7 %	990	Anatomic split: legs penalised harder than arms. Foot-position lock added.

Master frame strip — all 6 iterations top to bottom

All 6 iterations — top to bottom Iter 1 (flailing) → Iter 2 (hunched) → 15 M (holds slightly bent) → 25 M (upright) → Iter 5 (still) → Iter 6 (peak)

Frame strip iteration 1 — robot flails and falls

Iteration 1 — flails & falls 3 M steps · 0 % success · arms fully extended, collapses fast

Failed

Frame strip iteration 2 — hunched but more stable

Iteration 2 — hunched, more stable 5 M steps · 5 % success · pose-reference reward takes effect

Breakthrough

Frame strip 15 M steps — holds slightly bent

15 M steps — holds, slightly bent Iter 3 · 25 % success · longer stand time, training still noisy

Progress

25 M steps — stands upright Iter 4 · 83 % success · full 1000-step episode, stable & calm

Goal reached

Frame strip iteration 5 — stillness with uniform velocity penalty

Iteration 5 — stillness, uniform penalty Velocity penalty across all joints · 90 % success · drift drops to 0.53 m

Progress

Frame strip iteration 6 — lower/upper split and foot lock

Iteration 6 — Lower/Upper-Split + Foot-Lock Anatomically differentiated · 96.7 % success · maximum stability & uprightness

Peak

5 Key Insight

★ Human in the Loop

The Insight the Algorithm Could Not Find on Its Own

After iteration 1, more compute would not have fixed the problem. The reward structure itself was wrong. The diagnosis: the agent was maximising the reward signal — not the intended behaviour. The solution required a domain decision, not more training.

"Tell the robot explicitly: this is the standing pose — and pull every deviation back toward it."

That is the central design decision of the project. Not "stay vaguely upright" but a concrete target posture as a reference, with a reward that actively guides the policy there. Subsequent reward refinements (Lower/Upper anatomic split, foot-position lock) followed the same pattern: domain knowledge translated into reward structure before retraining.

The reward engineering workflow

Each iteration is traceable to one domain insight. No iteration was started without a hypothesis about what was limiting the previous one.

Define verifiable goal → Diagnose failure mode → Translate domain knowledge to reward term → Train → Measure outcome → Repeat

      Iteration 2 — dominant pose-reference term
pose_reward = 2.0 * exp(-6 * mean squared joint deviation from stand pose)
    

Maximum at exact standing posture, decays with any deviation. Combined with a mild anti-jitter term. Stand time nearly tripled versus iteration 1.

6 Honest Constraint

The Ankle Limit — a Physical Bottleneck, Not an Algorithm Bug

Iteration 6 reaches 96.7 % success but statue-level stillness is not yet achieved. The reason is identified and measured.

⚠

Ankle actuator load: 82 %

MuJoCo inverse dynamics analysis shows ankle actuators running at 82 % of their torque limit just to maintain position. The remaining 18 % capacity leaves little room for active balance correction. This is a physical constraint of the H1 model, not a reward design problem.

The planned next step (Phase 5: Balance Assist) addresses this via ground-reaction-force feedback and a curriculum assist that reduces the torque load required from the ankle actuators.

7 What's Next

Roadmap — Honest Status

Phase 1 (standing) is complete. What remains is stated plainly.

Phase	Status	Core mechanism
Phase 5 — Balance Assist	Designed, not trained	Ground-reaction-force feedback + curriculum assist to offload ankle actuators. Target: statue-level stillness.
Phase 2 — Get Up	Code complete, not trained	8-stage choreography (sit-up → squat → stand). Requires Phase 5 base model + larger compute budget.
ROS2 Integration	Not in this project	ROS2/Jazzy work runs in a parallel project (Gazebo + Blender simulation). Bridge to this RL policy not yet built.
Sim2Real Transfer	Not performed	Domain randomisation, motor latency, friction variance on physical H1 hardware — next step before any hardware test.

🔗

Full code, reward formulas, training logs — Apache 2.0

github.com/OlafStolle/h1-stand-rl

8 About the Maker

Olaf Stolle — Ai-Crafters

20+ years bridging hardware and software. Robotics is the natural next step in that arc.

Background covers construction 3D modelling, custom GIS tooling (ArcGIS Add-ons, spatial algorithms, Leaflet/PostGIS), embedded telemetry (off-grid inverter API, water meter OCR on Raspberry Pi), and three years of AI/automation consulting as a freelancer via Ai-Crafters — covering n8n, Claude, OpenAI, Voice-AI and RAG pipelines.

Formal qualification: Maurermeister — DQR/EQF Level 6 (Bachelor Professional in the European Qualifications Framework). Domain expertise from structural and civil engineering work feeds directly into the physics intuition applied to reward design here.

This project connects all three strands: physical domain knowledge, engineering practice from hardware work, and AI tooling from automation consulting.

✉ info@ai-crafters.io 🌐 ai-crafters.io 💬 WhatsApp ★ GitHub