Robotics · Reinforcement Learning · Portfolio

A Humanoid Robot
Learns to Stand by Itself

Reinforcement learning on Unitree H1 in MuJoCo — 6 iterations, 0 % to 97 %.

45 M
Training steps
6
Reward iterations
96.7 %
Success rate
19
Actuators
Side-by-Side

Iteration 1 → 4 → 6 in One Take

The full learning arc at 3× speed. Left: iteration 1 collapses immediately. Middle: iteration 4 stands but drifts. Right: iteration 6 — upright, stable, minimal sway.

Side-by-side comparison, 3× playback speed. All three clips are 1000-step episodes (~10 seconds of simulation time).

1 Starting Point

A Robot That Collapses Without Control

Load the Unitree H1 into MuJoCo — a physics engine that simulates gravity, mass and joints accurately — and release it without a controller. It falls immediately. That is the expected baseline. Standing is active work.

🤖

19 Motors

Hips, knees, ankles, torso and arms — each motor must deliver the right torque at every timestep.

🎶

25 Degrees of Freedom

25 moving axes that must stay balanced simultaneously — in milliseconds, continuously.

🧠

Not Solvable by Hand

No human can manually coordinate 19 motors fast enough to keep a humanoid upright. The controller must be learned.

2 Making the Goal Verifiable

"Learn to Stand" Isn't Testable. This Is.

A vague objective cannot be verified. The first step was translating the intent into a concrete success criterion.

🎯
Verifiable criterion

Hold the torso at approximately 0.98 m height for 1000 simulation steps (~10 seconds), upright, without falling.

0.98 m is the torso height in the H1's default upright pose. Reaching this criterion is measurable, not asserted.

3 Approach

Three Options Weighed — One Chosen

Before writing code, three realistic paths were evaluated.

PPO with a custom MuJoCo environment Chosen

A custom training environment built around the actual H1 MJCF model, trained with the PPO algorithm (Stable-Baselines3). Full control over observations, rewards and termination logic — and the exact robot in question.

MuJoCo Playground / MJX Discarded

MJX massively accelerates simulation — but only with a GPU. On a CPU-only machine it adds complexity without benefit.

Off-the-shelf humanoid environment Discarded

Pre-built environments use a generic humanoid, not the Unitree H1. Results would not transfer to the real hardware.

ComponentVersionRole
PyTorch2.x (CPU)Neural network & training
MuJoCo3.8Physics simulation
Gymnasium1.2Environment interface
Stable-Baselines32.8PPO implementation
4 The Six Iterations

From 0 % to 96.7 % — Step by Step

Each iteration changed one thing, measured the outcome, and traced the improvement or failure back to a specific design decision.

IterationStepsSuccessMean stand timeKey change
1 — Reward v1 3 M 0 % 159 Baseline: alive-bonus + height + upright. Robot exploits reward, flails.
2 — Pose reference 5 M 5 % 456 Human-in-the-loop: explicit stand-pose reference added as dominant reward term.
3 — Resume 15 M 25 % 594 Continued from Iter 2. Upward trend but training instability (clip_fraction 0.76).
4 — Stabilised 25 M 83.3 % 930 Learning rate 3e-4 → 1e-4. Clean convergence; robot stands upright.
5 — Stillness 35 M 90 % 949 Velocity penalty over all joints. XY-drift cut from 3.4 m to 0.53 m.
6 — Lower/Upper + Foot-Lock 45 M 96.7 % 990 Anatomic split: legs penalised harder than arms. Foot-position lock added.
Master frame strip — all 6 iterations top to bottom
All 6 iterations — top to bottom Iter 1 (flailing) → Iter 2 (hunched) → 15 M (holds slightly bent) → 25 M (upright) → Iter 5 (still) → Iter 6 (peak)
Frame strip iteration 1 — robot flails and falls
Iteration 1 — flails & falls 3 M steps · 0 % success · arms fully extended, collapses fast
Failed
Frame strip iteration 2 — hunched but more stable
Iteration 2 — hunched, more stable 5 M steps · 5 % success · pose-reference reward takes effect
Breakthrough
Frame strip 15 M steps — holds slightly bent
15 M steps — holds, slightly bent Iter 3 · 25 % success · longer stand time, training still noisy
Progress
Frame strip 25 M steps — stands upright
25 M steps — stands upright Iter 4 · 83 % success · full 1000-step episode, stable & calm
Goal reached
Frame strip iteration 5 — stillness with uniform velocity penalty
Iteration 5 — stillness, uniform penalty Velocity penalty across all joints · 90 % success · drift drops to 0.53 m
Progress
Frame strip iteration 6 — lower/upper split and foot lock
Iteration 6 — Lower/Upper-Split + Foot-Lock Anatomically differentiated · 96.7 % success · maximum stability & uprightness
Peak
5 Key Insight
★ Human in the Loop

The Insight the Algorithm Could Not Find on Its Own

After iteration 1, more compute would not have fixed the problem. The reward structure itself was wrong. The diagnosis: the agent was maximising the reward signal — not the intended behaviour. The solution required a domain decision, not more training.

"Tell the robot explicitly: this is the standing pose — and pull every deviation back toward it."

That is the central design decision of the project. Not "stay vaguely upright" but a concrete target posture as a reference, with a reward that actively guides the policy there. Subsequent reward refinements (Lower/Upper anatomic split, foot-position lock) followed the same pattern: domain knowledge translated into reward structure before retraining.

The reward engineering workflow

Each iteration is traceable to one domain insight. No iteration was started without a hypothesis about what was limiting the previous one.

Define verifiable goal Diagnose failure mode Translate domain knowledge to reward term Train Measure outcome Repeat
Iteration 2 — dominant pose-reference term pose_reward = 2.0 * exp(-6 * mean squared joint deviation from stand pose)

Maximum at exact standing posture, decays with any deviation. Combined with a mild anti-jitter term. Stand time nearly tripled versus iteration 1.

6 Honest Constraint

The Ankle Limit — a Physical Bottleneck, Not an Algorithm Bug

Iteration 6 reaches 96.7 % success but statue-level stillness is not yet achieved. The reason is identified and measured.

Ankle actuator load: 82 %

MuJoCo inverse dynamics analysis shows ankle actuators running at 82 % of their torque limit just to maintain position. The remaining 18 % capacity leaves little room for active balance correction. This is a physical constraint of the H1 model, not a reward design problem.

The planned next step (Phase 5: Balance Assist) addresses this via ground-reaction-force feedback and a curriculum assist that reduces the torque load required from the ankle actuators.

7 What's Next

Roadmap — Honest Status

Phase 1 (standing) is complete. What remains is stated plainly.

PhaseStatusCore mechanism
Phase 5 — Balance Assist Designed, not trained Ground-reaction-force feedback + curriculum assist to offload ankle actuators. Target: statue-level stillness.
Phase 2 — Get Up Code complete, not trained 8-stage choreography (sit-up → squat → stand). Requires Phase 5 base model + larger compute budget.
ROS2 Integration Not in this project ROS2/Jazzy work runs in a parallel project (Gazebo + Blender simulation). Bridge to this RL policy not yet built.
Sim2Real Transfer Not performed Domain randomisation, motor latency, friction variance on physical H1 hardware — next step before any hardware test.
🔗
Full code, reward formulas, training logs — Apache 2.0

github.com/OlafStolle/h1-stand-rl

8 About the Maker

Olaf Stolle — Ai-Crafters

20+ years bridging hardware and software. Robotics is the natural next step in that arc.

Background covers construction 3D modelling, custom GIS tooling (ArcGIS Add-ons, spatial algorithms, Leaflet/PostGIS), embedded telemetry (off-grid inverter API, water meter OCR on Raspberry Pi), and three years of AI/automation consulting as a freelancer via Ai-Crafters — covering n8n, Claude, OpenAI, Voice-AI and RAG pipelines.

Formal qualification: Maurermeister — DQR/EQF Level 6 (Bachelor Professional in the European Qualifications Framework). Domain expertise from structural and civil engineering work feeds directly into the physics intuition applied to reward design here.

This project connects all three strands: physical domain knowledge, engineering practice from hardware work, and AI tooling from automation consulting.

✉ info@ai-crafters.io 🌐 ai-crafters.io 💬 WhatsApp ★ GitHub