How neural network bots learn to fight through billions of simulated rounds—from a critical bug fix to ship-specific abilities, with surprises at every stage.
Two machines work in parallel. A Mac runs game development, evaluation, and pushes training instructions. A Windows machine with an RTX 3090 trains models on GPU. They coordinate through git—the simplest protocol that works.
Training happens in a custom JAX physics simulation that mirrors the C++ game engine. 4,096 arenas run in parallel on a single GPU, each simulating complete combat with Newtonian physics, projectile collision, energy management, and up to 17 combat abilities (14 base + 3 ship-specific).
The simulation must match the game exactly or the trained model won't transfer. 31 cross-check tests verify parity: thrust values, wall bouncing, bullet damage, bomb splash radius, mine proximity triggers, ship-specific stats, ability mechanics. When a value changes in C++, the same change must be made in Python. 189 behavior tests and 86 visual tests verify the UE5 side.
The neural network sees the world as 96 numbers, matching what a human player could perceive:
| Feature | Size | What it encodes |
|---|---|---|
| Self | 22 | Velocity, heading, energy, cooldowns, ability charges, ship type |
| Enemies (3 nearest) | 27 | Relative position, velocity, heading, energy, alive status, ship type |
| Zone | 3 | KOTH zone direction, inside/outside |
| Tactical | 4 | Round timer, speed, alive enemies, energy advantage |
| Walls | 12 | 4 boundary distances + 8 interior wall raycasts |
| Mines | 12 | 4 nearest mines (position, friendly/enemy) |
| Projectiles | 10 | 5 nearest incoming bullets |
| Pickups | 6 | 2 nearest item pickups (position, type) |
The base "team" preset has 14 actions (40 logits). Ship-specific presets add 1–3 extra binary actions for unique abilities:
| Action | Bins | What it controls |
|---|---|---|
| Rotation | 11 | Turn rate from hard-left to hard-right |
| Thrust | 5 | Reverse, brake, coast, half, full |
| Bullet | 2 | Primary weapon. Hydra/Tempest fire 3 linked. Lurker weakest (80 dmg) |
| Bomb | 2 | Slow explosive, area damage. Tempest bomb bounces off 1 wall before detonating |
| Mine | 2 | Proximity trap. Comet/Tempest/Bastion have none |
| Repulsor | 2 | Deflects nearby enemy projectiles (from green pickups, max 2 charges) |
| MIRV | 2 | Cluster missile, splits 1→8. Choir ships: Spore Bomb instead (from green pickups) |
| Ripper | 2 | Piercing beam. Choir/Tempest: 360° Burst instead (from green pickups) |
| Overdrive | 2 | Speed boost, drains energy. All ships have this |
| Stealth | 2 | Invisibility, drains energy (Specter built-in, others don't use) |
| Burst | 2 | 360° bullet burst, 24 bullets (from green pickups, max 3 charges) |
| Rocket | 2 | 6s speed boost for hit-and-run (Comet built-in, 20s cooldown) |
| Warp | 2 | Short-range teleport forward |
| Portal | 2 | Place entry/exit teleport pair (Bastion) |
The simulation makes deliberate simplifications to hit 570K steps/sec. Each trade-off has a cost and a way we verify it doesn't break transfer to the real game.
| Trade-off | Cost | How we account for it |
|---|---|---|
| 11 rotation bins (~16° resolution) | Can't aim precisely between bins. Fine tracking limited. | Tested 21 bins in Phase 20—didn't improve win rate, but that test was confounded (started from scratch, not fine-tuned). Not conclusive. Would revisit if a strong model shows high engagement but low hit rate. |
| Fast MIRV (8 missiles instantly, no 1→2→4→8 cascade) | Bot won't learn split timing or wall-bounce-then-split tactics. | Phase 38 hit 95% WR with fast MIRV. If timing matters for higher-skill play, we remove the flag and retrain. |
| Fixed observation window (5 enemies, 5 bullets, 2 pickups) | Blind to threats outside the nearest 5. Would miss flankers in large battles. | FFA has 4 ships, so 5 enemy slots exceeds actual count. Needs revisiting for 8+ player modes. |
| Simplified physics (2D, no visual effects) | Wall bounce angles, projectile inheritance, energy drain could diverge from UE5. | 31 cross-check tests verify parity on every core mechanic. When C++ changes, Python must match. |
| Single frozen opponent per training phase | Overspecialization. Phases 14, 36, 39 all regressed from equal-strength opponents. | Self-play works once against a clearly weaker opponent (Phase 35 recipe). Multi-opponent rollout code exists but only uses frozen_pool[0]—fix pending. |
The reward function has 9 tunable coefficients, each exposed as a CLI flag. The defaults were calibrated through iteration—the first 10 phases established which signals matter, and the eval pipeline tells us when to adjust.
| Signal | Default | What it encourages |
|---|---|---|
| Damage dealt | /500 | Base combat reward |
| Bullet bonus | 0.8 | Primary weapon accuracy (prevents bomb-only play) |
| Kill | 1.0 | Finishing kills |
| Death penalty | 1.0 | Staying alive |
| Engage distance | 1.0 | Approaching enemies (prevents passive kiting) |
| Mine/ability bonus | 0.3/0.2 | Using all weapons, not just bullets |
| Item pickup | 0.05 | Collecting green pickups for ability charges |
| Item proximity | 0.1 | Moving toward item spawns |
| Time pressure | 0.002 | Decisive play (penalizes stalling) |
Key insight: The best model (Phase 38) uses none of the shaping rewards. Pure kill/death + zone reward (0.3/step for KOTH) outperformed every shaped variant. The 9 coefficients above were useful for early exploration but the LR bug fix (Phase 32) eliminated the need for them—a properly trained optimizer finds aggressive play on its own.
After each training phase, an automated evaluation runs the model against random opponents and frozen previous-generation models. A diagnosis script compares metrics against behavioral thresholds and recommends reward flag adjustments.
Win rate alone can't distinguish a good fighter from a passive one that wins by not dying. The eval tracks 40+ metrics per bot:
| Category | Metrics |
|---|---|
| Combat | Win rate, decisiveness, kills/min, avg round time |
| Accuracy | Hit rate per weapon type, damage dealt/taken ratio |
| Behavior | Ability usage (7 types), item pickups, engagement distance |
| Efficiency | Energy per kill, min energy reached, overdrive frames |
| Strategy | Bomb+bullet combos, repulsor+MIRV combos, mine kills, avg TTK |
$ python3 eval_diagnose.py
ISSUES FOUND:
- Low hit rate: 2.6% (threshold: 5%)
- Zero item pickups: 0
- High engage distance: 3575u (threshold: 3000u)
RECOMMENDED REWARD FLAGS:
--reward-bullet-bonus 0.2 (reduce spam incentive)
--reward-engage 1.0 (force closer engagement)
--reward-pickup 0.15 (reward item collection)
BEFORE ACTING ON THESE RECOMMENDATIONS:
1. Compare against previous phase
2. Pick 1-2 flags max
3. Check if root cause is structural
4. Form a hypothesis: "Changed X because Y, expect Z"
The script recommends but doesn't decide. The human reviews, picks 1-2 changes, forms a hypothesis, and pushes instructions. Changing many variables at once makes results unattributable.
Each training phase follows a strict methodology: form a hypothesis ("increasing engage reward will fix passive play"), change one variable, train 2B steps, evaluate against multiple baselines (random, previous generations, best model), and diagnose using 40+ metrics—not just win rate. If the hypothesis was wrong, the metrics tell you why: was it passive play (high timeouts, low damage), overspecialization (beats one opponent, loses to random), or a reward imbalance (one ability dominates)?
44 phases, each building on the previous. The table below is an experiment log, not a changelog.
| Phase | WR | Change | Result |
|---|---|---|---|
| 11 | 42% | Engage shaping 0.5→1.0 | Baseline aggression established |
| 13 | 60% | Self-play vs Phase 12 (50% mix) | Duel breakthrough. All 7 abilities used. |
| 14 | 31% | Self-play vs Phase 13 (50% mix) | Regression. Self-play past one round overspecializes. |
| 20 | 43% | 21 rotation bins (5B steps) | Beat Phase 13 head-to-head (53%) but lost to random. Precision ≠ generalization. |
| 23 | 73% | FFA deathmatch (4 ships, team preset) | FFA breakthrough. Paradigm shift from duel to 4-player free-for-all. |
| 25 | 79% | Updated sim, 18 abilities, --fast-mirv | New best. But plays passively (31% timeouts), regressed vs Phase 13. |
| 26 | 79% | Engage reward 1.5, self-play vs Phase 13 | Null result. Every metric identical to Phase 25. |
| 27 | 79% | Ablation: 9 coefficients removed, 500M steps each | All identical. Converged policy can't be shifted by fine-tuning. |
| 28–31 | — | Various from-scratch experiments | All invalidated. LR schedule bug discovered—see below. |
| LR BUG FIX — ALL PHASES BELOW USE CORRECT LR SCHEDULE | |||
| 32 | 94% | Fresh start, pure kill/death reward, fixed LR | First real training. More improvement in 7h than 31 prior phases. |
| 33 | 94% | Warm-start P32 +2B steps | Broken arena curriculum acted as accidental regularizer—beat P32 62% H2H. |
| 35 | 96% | Self-play vs P32 at 30% mix | New best FFA. 76.8% vs Phase 13. Self-play recipe validated. |
| 36 | 96% | Self-play vs P33 (equal strength) | Regression. Equal-strength opponent causes overspecialization. |
| 38 | 95% | KOTH zone reward (0.3/step), 6 ships | Best overall. 99.6% 6-ship. Zone reward = engagement regularizer. |
| 41 | 86% | Viper + FocusFire (fresh, new action space) | Ship-specific ability learned (344 uses/match). Fresh start gap vs P38. |
| 43 | 82% | Lurker + MineDash (fresh, new action space) | 99.4% in 6-ship despite 4.3M training kills. MineDash used strategically. |
| 44 | — | Tempest + ShrapnelBurst (fresh) | Training. |
Phase 38 best model (post-LR fix): KOTH training with zone reward (0.3/step) warm-started from Phase 35 produced 95.2% WR in 4-ship FFA, 99.6% in 6-ship, and 73.4% vs Phase 35 head-to-head. The zone reward acted as an engagement regularizer—pulling ships toward center forced more combat, producing a stronger fighter than direct FFA training. Ship-specific models (Viper, Lurker) achieve 97–99% in 6-ship with their unique abilities.
Self-play doesn't scale (Phases 14–21). One round of self-play works (Phase 12→13). Iterating beyond that causes overspecialization at every mix level tested (50%, 30%, 20%). WR vs random is the canary—if it drops while WR vs the frozen opponent rises, the model is narrowing. The FFA pivot (Phase 23) solved this by using natural opponent diversity instead of artificial self-play.
More compute doesn't fix structural problems (Phases 19–22). Finer aim resolution (21 bins vs 11) plateaued at 43% regardless of step count. Different seeds with identical config produced 48% vs 37%. The breakthroughs (Phase 13, Phase 23) came from structural changes—self-play, FFA—not from more steps or finer control.
Converged policies are stuck (Phases 26–27). Increasing engagement reward, adding self-play opponents, and ablating all 9 reward coefficients individually—none of it shifted the Phase 25 model. Every metric stayed identical. Fine-tuning a converged checkpoint can't escape a local minimum.
Phase 28 ran from scratch with minimal reward. Results looked promising. Then a code review revealed the learning rate schedule was counting minibatch steps instead of PPO updates—the LR decayed to zero after less than one real update. Every model from Phase 1 through Phase 31 had trained with LR≈0 for 99%+ of compute. Phase 25's "79% WR" came from less than half a gradient step of actual learning.
The fix was one line in the optimizer setup. Phase 32 REDO—the first properly trained model—hit 94.4% vs random in a single 2B-step run. More improvement in 7 hours than the previous 31 phases combined.
Phase 35 warm-started from Phase 33 with Phase 32 as a frozen opponent at 30% mix. Result: 96.2% vs random, 76.8% vs Phase 13—new best across all benchmarks. But the recipe has strict constraints:
Opponent must be clearly weaker. Phase 36 used Phase 33 (roughly equal strength) as opponent. Universal regression—every benchmark dropped. Phase 37 repeated Phase 35's recipe from Phase 35 itself. Plateau—identical results. Self-play against an equal or same-strength opponent causes overspecialization. Against a weaker one (Phase 35 beat Phase 32 66%), it works exactly once.
KOTH training added a per-step zone reward (0.3, vs kill reward of 1.0) pulling ships toward the map center. The intended effect was zone-seeking behavior. The actual effect was stronger: the KOTH model beat the best FFA model 73.4% head-to-head in pure FFA combat. Zone reward forced more engagements (ships near center fight more often), producing richer gradient signal. The best combat model came from training for an objective other than combat.
Each ship archetype gets a unique ability added to the sim and action space. The model trains from scratch with the expanded preset and learns when to use the ability alongside the 14 base combat actions.
| Ship | Ability | Mechanic | 6-ship WR | Usage/match |
|---|---|---|---|---|
| Viper | FocusFire | +25% bullet dmg, +50% speed for 4s | 97.2% | 344 |
| Lurker | MineDash | Dash 3x speed + drop 2 mines | 99.4% | 24 |
| Tempest | ShrapnelBurst | 8 bullets in 60° forward cone | training | |
Usage patterns reveal strategic learning: FocusFire activates 344 times per match (pre-engagement buff), while MineDash fires only 24 times (escape/engage tool, not spam). The models discover ability timing from reward signal alone—no explicit "use ability before fighting" shaping.
Phase 13 was the first model to use all 7 combat abilities meaningfully. Prior models relied almost entirely on bullets and bombs. What changed:
The recipe: strong base model (Phase 12, trained with engage + bullet shaping) + one round of self-play. The base model learned what to do. Self-play taught it when.
Trained models export from JAX (Flax parameters) to ONNX format, which UE5's Neural Network Engine (NNE) runs on CPU at inference time. The bot controller builds the same 96-float observation vector as the training sim and decodes the logit output into ship controls. Ship-specific models use different action presets (42–46 logits) but share the same observation space, so one inference pipeline handles all ship types.
Model size is ~130–140KB (two 128-unit hidden layers). Inference runs in <0.1ms per bot per frame. The game loads the correct model per ship archetype via the ability loadout system—Viper bots use the FocusFire model, Lurker bots use the MineDash model, and generic ships fall back to the base KOTH model.
This pipeline follows good experimental hygiene—single-variable changes, multiple baselines, behavioral metrics beyond win rate—but falls short of academic rigor in several ways worth being explicit about.
Single seed per experiment. Each phase runs once with one random seed. Phase 22 demonstrated seed sensitivity: identical hyperparameters produced 48% vs 37% depending on seed. The academic standard is 20+ seeds with confidence intervals. At 13 hours per run on a single RTX 3090, that's 10+ days per experiment—impractical here. Results should be read as "this seed produced X" rather than "this configuration reliably produces X."
Confidence intervals added late. Phases 11-25 were evaluated without confidence intervals. CIs (Wilson score, 95%) are now computed in the eval script. With 500 matches, the 95% CI is roughly ±3-4%, which means Phase 23 (73%) vs Phase 25 (79%) is likely significant, but Phase 24 (74%) vs Phase 23 (73%) probably isn't. Future phases report CIs; historical phases are point estimates only.
Ablation requires from-scratch training. Phase 27 attempted to ablate 9 reward coefficients by fine-tuning from a converged checkpoint (500M steps each). All 9 produced identical behavior—the policy was locked. Post-LR-fix, Phase 32 REDO trained from scratch with pure kill/death reward (all shaping zeroed) and achieved 94.4% immediately—confirming the problem was the LR bug, not reward design.
Learning curves underutilized. Training logs capture entropy, policy loss, and value loss per update, but these aren't systematically analyzed for plateau detection or collapse warnings. Plotting learning curves across phases would give earlier signal on whether a run is worth continuing.
More ship abilities. Three Lattice-faction item abilities are designed (Hex Burst, Cascade Bomb, Stun Mine) and planned for Bastion and Titan. The action space is forward-compatible—reserved slots for faction-specific items avoid retraining existing models.
Fix multi-opponent rollout. The --opponent flag accepts multiple paths but only uses the first (frozen_pool[0]). Fixing this would enable opponent diversity without manual rotation—training against a pool of 3–5 prior generations simultaneously.
Larger models for ship-specific training. Ship-specific models start from scratch and need 4B+ steps to match the generic model's accumulated lineage (6B+ across P32→P38). A 256×256 network or longer training budget would close this gap.