Reinforcement learning on Unitree H1 in MuJoCo — 6 iterations, 0 % to 97 %.
The full learning arc at 3× speed. Left: iteration 1 collapses immediately. Middle: iteration 4 stands but drifts. Right: iteration 6 — upright, stable, minimal sway.
Side-by-side comparison, 3× playback speed. All three clips are 1000-step episodes (~10 seconds of simulation time).
Load the Unitree H1 into MuJoCo — a physics engine that simulates gravity, mass and joints accurately — and release it without a controller. It falls immediately. That is the expected baseline. Standing is active work.
Hips, knees, ankles, torso and arms — each motor must deliver the right torque at every timestep.
25 moving axes that must stay balanced simultaneously — in milliseconds, continuously.
No human can manually coordinate 19 motors fast enough to keep a humanoid upright. The controller must be learned.
A vague objective cannot be verified. The first step was translating the intent into a concrete success criterion.
Hold the torso at approximately 0.98 m height for 1000 simulation steps (~10 seconds), upright, without falling.
0.98 m is the torso height in the H1's default upright pose. Reaching this criterion is measurable, not asserted.
Before writing code, three realistic paths were evaluated.
A custom training environment built around the actual H1 MJCF model, trained with the PPO algorithm (Stable-Baselines3). Full control over observations, rewards and termination logic — and the exact robot in question.
MJX massively accelerates simulation — but only with a GPU. On a CPU-only machine it adds complexity without benefit.
Pre-built environments use a generic humanoid, not the Unitree H1. Results would not transfer to the real hardware.
| Component | Version | Role |
|---|---|---|
| PyTorch | 2.x (CPU) | Neural network & training |
| MuJoCo | 3.8 | Physics simulation |
| Gymnasium | 1.2 | Environment interface |
| Stable-Baselines3 | 2.8 | PPO implementation |
Each iteration changed one thing, measured the outcome, and traced the improvement or failure back to a specific design decision.
| Iteration | Steps | Success | Mean stand time | Key change |
|---|---|---|---|---|
| 1 — Reward v1 | 3 M | 0 % | 159 | Baseline: alive-bonus + height + upright. Robot exploits reward, flails. |
| 2 — Pose reference | 5 M | 5 % | 456 | Human-in-the-loop: explicit stand-pose reference added as dominant reward term. |
| 3 — Resume | 15 M | 25 % | 594 | Continued from Iter 2. Upward trend but training instability (clip_fraction 0.76). |
| 4 — Stabilised | 25 M | 83.3 % | 930 | Learning rate 3e-4 → 1e-4. Clean convergence; robot stands upright. |
| 5 — Stillness | 35 M | 90 % | 949 | Velocity penalty over all joints. XY-drift cut from 3.4 m to 0.53 m. |
| 6 — Lower/Upper + Foot-Lock | 45 M | 96.7 % | 990 | Anatomic split: legs penalised harder than arms. Foot-position lock added. |
After iteration 1, more compute would not have fixed the problem. The reward structure itself was wrong. The diagnosis: the agent was maximising the reward signal — not the intended behaviour. The solution required a domain decision, not more training.
That is the central design decision of the project. Not "stay vaguely upright" but a concrete target posture as a reference, with a reward that actively guides the policy there. Subsequent reward refinements (Lower/Upper anatomic split, foot-position lock) followed the same pattern: domain knowledge translated into reward structure before retraining.
Each iteration is traceable to one domain insight. No iteration was started without a hypothesis about what was limiting the previous one.
Maximum at exact standing posture, decays with any deviation. Combined with a mild anti-jitter term. Stand time nearly tripled versus iteration 1.
Iteration 6 reaches 96.7 % success but statue-level stillness is not yet achieved. The reason is identified and measured.
MuJoCo inverse dynamics analysis shows ankle actuators running at 82 % of their torque limit just to maintain position. The remaining 18 % capacity leaves little room for active balance correction. This is a physical constraint of the H1 model, not a reward design problem.
The planned next step (Phase 5: Balance Assist) addresses this via ground-reaction-force feedback and a curriculum assist that reduces the torque load required from the ankle actuators.
Phase 1 (standing) is complete. What remains is stated plainly.
| Phase | Status | Core mechanism |
|---|---|---|
| Phase 5 — Balance Assist | Designed, not trained | Ground-reaction-force feedback + curriculum assist to offload ankle actuators. Target: statue-level stillness. |
| Phase 2 — Get Up | Code complete, not trained | 8-stage choreography (sit-up → squat → stand). Requires Phase 5 base model + larger compute budget. |
| ROS2 Integration | Not in this project | ROS2/Jazzy work runs in a parallel project (Gazebo + Blender simulation). Bridge to this RL policy not yet built. |
| Sim2Real Transfer | Not performed | Domain randomisation, motor latency, friction variance on physical H1 hardware — next step before any hardware test. |
20+ years bridging hardware and software. Robotics is the natural next step in that arc.
Background covers construction 3D modelling, custom GIS tooling (ArcGIS Add-ons, spatial algorithms, Leaflet/PostGIS), embedded telemetry (off-grid inverter API, water meter OCR on Raspberry Pi), and three years of AI/automation consulting as a freelancer via Ai-Crafters — covering n8n, Claude, OpenAI, Voice-AI and RAG pipelines.
Formal qualification: Maurermeister — DQR/EQF Level 6 (Bachelor Professional in the European Qualifications Framework). Domain expertise from structural and civil engineering work feeds directly into the physics intuition applied to reward design here.
This project connects all three strands: physical domain knowledge, engineering practice from hardware work, and AI tooling from automation consulting.