Deep Dive

RL Training Pipeline

How neural network bots learn to fight through billions of simulated rounds—from a critical bug fix to ship-specific abilities, with surprises at every stage.

The pipeline

Two machines work in parallel. A Mac runs game development, evaluation, and pushes training instructions. A Windows machine with an RTX 3090 trains models on GPU. They coordinate through git—the simplest protocol that works.

Train

GPU, 7–44 hrs

→

Push Model

git, auto

→

Evaluate

JAX + UE5

→

Diagnose

40+ metrics

→

New Instructions

reward tuning

← Windows agent polls, picks up new instructions, repeats

4,096

Parallel envs

570K

Steps/sec (GPU)

Training phases

Cross-check tests

The simulation

Training happens in a custom JAX physics simulation that mirrors the C++ game engine. 4,096 arenas run in parallel on a single GPU, each simulating complete combat with Newtonian physics, projectile collision, energy management, and up to 17 combat abilities (14 base + 3 ship-specific).

The simulation must match the game exactly or the trained model won't transfer. 31 cross-check tests verify parity: thrust values, wall bouncing, bullet damage, bomb splash radius, mine proximity triggers, ship-specific stats, ability mechanics. When a value changes in C++, the same change must be made in Python. 189 behavior tests and 86 visual tests verify the UE5 side.

Observation vector (96 floats)

The neural network sees the world as 96 numbers, matching what a human player could perceive:

Feature	Size	What it encodes
Self	22	Velocity, heading, energy, cooldowns, ability charges, ship type
Enemies (3 nearest)	27	Relative position, velocity, heading, energy, alive status, ship type
Zone	3	KOTH zone direction, inside/outside
Tactical	4	Round timer, speed, alive enemies, energy advantage
Walls	12	4 boundary distances + 8 interior wall raycasts
Mines	12	4 nearest mines (position, friendly/enemy)
Projectiles	10	5 nearest incoming bullets
Pickups	6	2 nearest item pickups (position, type)

Action space (14–17 discrete actions, 40–46 logits)

The base "team" preset has 14 actions (40 logits). Ship-specific presets add 1–3 extra binary actions for unique abilities:

Action	Bins	What it controls
Rotation	11	Turn rate from hard-left to hard-right
Thrust	5	Reverse, brake, coast, half, full
Bullet	2	Primary weapon. Hydra/Tempest fire 3 linked. Lurker weakest (80 dmg)
Bomb	2	Slow explosive, area damage. Tempest bomb bounces off 1 wall before detonating
Mine	2	Proximity trap. Comet/Tempest/Bastion have none
Repulsor	2	Deflects nearby enemy projectiles (from green pickups, max 2 charges)
MIRV	2	Cluster missile, splits 1→8. Choir ships: Spore Bomb instead (from green pickups)
Ripper	2	Piercing beam. Choir/Tempest: 360° Burst instead (from green pickups)
Overdrive	2	Speed boost, drains energy. All ships have this
Stealth	2	Invisibility, drains energy (Specter built-in, others don't use)
Burst	2	360° bullet burst, 24 bullets (from green pickups, max 3 charges)
Rocket	2	6s speed boost for hit-and-run (Comet built-in, 20s cooldown)
Warp	2	Short-range teleport forward
Portal	2	Place entry/exit teleport pair (Bastion)

Constraints & trade-offs

The simulation makes deliberate simplifications to hit 570K steps/sec. Each trade-off has a cost and a way we verify it doesn't break transfer to the real game.

Trade-off	Cost	How we account for it
11 rotation bins (~16° resolution)	Can't aim precisely between bins. Fine tracking limited.	Tested 21 bins in Phase 20—didn't improve win rate, but that test was confounded (started from scratch, not fine-tuned). Not conclusive. Would revisit if a strong model shows high engagement but low hit rate.
Fast MIRV (8 missiles instantly, no 1→2→4→8 cascade)	Bot won't learn split timing or wall-bounce-then-split tactics.	Phase 38 hit 95% WR with fast MIRV. If timing matters for higher-skill play, we remove the flag and retrain.
Fixed observation window (5 enemies, 5 bullets, 2 pickups)	Blind to threats outside the nearest 5. Would miss flankers in large battles.	FFA has 4 ships, so 5 enemy slots exceeds actual count. Needs revisiting for 8+ player modes.
Simplified physics (2D, no visual effects)	Wall bounce angles, projectile inheritance, energy drain could diverge from UE5.	31 cross-check tests verify parity on every core mechanic. When C++ changes, Python must match.
Single frozen opponent per training phase	Overspecialization. Phases 14, 36, 39 all regressed from equal-strength opponents.	Self-play works once against a clearly weaker opponent (Phase 35 recipe). Multi-opponent rollout code exists but only uses `frozen_pool[0]`—fix pending.

Reward shaping

The reward function has 9 tunable coefficients, each exposed as a CLI flag. The defaults were calibrated through iteration—the first 10 phases established which signals matter, and the eval pipeline tells us when to adjust.

Signal	Default	What it encourages
Damage dealt	/500	Base combat reward
Bullet bonus	0.8	Primary weapon accuracy (prevents bomb-only play)
Kill	1.0	Finishing kills
Death penalty	1.0	Staying alive
Engage distance	1.0	Approaching enemies (prevents passive kiting)
Mine/ability bonus	0.3/0.2	Using all weapons, not just bullets
Item pickup	0.05	Collecting green pickups for ability charges
Item proximity	0.1	Moving toward item spawns
Time pressure	0.002	Decisive play (penalizes stalling)

Key insight: The best model (Phase 38) uses none of the shaping rewards. Pure kill/death + zone reward (0.3/step for KOTH) outperformed every shaped variant. The 9 coefficients above were useful for early exploration but the LR bug fix (Phase 32) eliminated the need for them—a properly trained optimizer finds aggressive play on its own.

The eval-diagnose-train loop

After each training phase, an automated evaluation runs the model against random opponents and frozen previous-generation models. A diagnosis script compares metrics against behavioral thresholds and recommends reward flag adjustments.

What we measure

Win rate alone can't distinguish a good fighter from a passive one that wins by not dying. The eval tracks 40+ metrics per bot:

Category	Metrics
Combat	Win rate, decisiveness, kills/min, avg round time
Accuracy	Hit rate per weapon type, damage dealt/taken ratio
Behavior	Ability usage (7 types), item pickups, engagement distance
Efficiency	Energy per kill, min energy reached, overdrive frames
Strategy	Bomb+bullet combos, repulsor+MIRV combos, mine kills, avg TTK

Automated diagnosis

$ python3 eval_diagnose.py

  ISSUES FOUND:
    - Low hit rate: 2.6% (threshold: 5%)
    - Zero item pickups: 0
    - High engage distance: 3575u (threshold: 3000u)

  RECOMMENDED REWARD FLAGS:
    --reward-bullet-bonus 0.2  (reduce spam incentive)
    --reward-engage 1.0        (force closer engagement)
    --reward-pickup 0.15       (reward item collection)

  BEFORE ACTING ON THESE RECOMMENDATIONS:
    1. Compare against previous phase
    2. Pick 1-2 flags max
    3. Check if root cause is structural
    4. Form a hypothesis: "Changed X because Y, expect Z"

The script recommends but doesn't decide. The human reviews, picks 1-2 changes, forms a hypothesis, and pushes instructions. Changing many variables at once makes results unattributable.

Training history: what worked, what didn't

Each training phase follows a strict methodology: form a hypothesis ("increasing engage reward will fix passive play"), change one variable, train 2B steps, evaluate against multiple baselines (random, previous generations, best model), and diagnose using 40+ metrics—not just win rate. If the hypothesis was wrong, the metrics tell you why: was it passive play (high timeouts, low damage), overspecialization (beats one opponent, loses to random), or a reward imbalance (one ability dominates)?

44 phases, each building on the previous. The table below is an experiment log, not a changelog.

Phase	WR	Change	Result
11	42%	Engage shaping 0.5→1.0	Baseline aggression established
13	60%	Self-play vs Phase 12 (50% mix)	Duel breakthrough. All 7 abilities used.
14	31%	Self-play vs Phase 13 (50% mix)	Regression. Self-play past one round overspecializes.
20	43%	21 rotation bins (5B steps)	Beat Phase 13 head-to-head (53%) but lost to random. Precision ≠ generalization.
23	73%	FFA deathmatch (4 ships, team preset)	FFA breakthrough. Paradigm shift from duel to 4-player free-for-all.
25	79%	Updated sim, 18 abilities, --fast-mirv	New best. But plays passively (31% timeouts), regressed vs Phase 13.
26	79%	Engage reward 1.5, self-play vs Phase 13	Null result. Every metric identical to Phase 25.
27	79%	Ablation: 9 coefficients removed, 500M steps each	All identical. Converged policy can't be shifted by fine-tuning.
28–31	—	Various from-scratch experiments	All invalidated. LR schedule bug discovered—see below.
LR BUG FIX — ALL PHASES BELOW USE CORRECT LR SCHEDULE
32	94%	Fresh start, pure kill/death reward, fixed LR	First real training. More improvement in 7h than 31 prior phases.
33	94%	Warm-start P32 +2B steps	Broken arena curriculum acted as accidental regularizer—beat P32 62% H2H.
35	96%	Self-play vs P32 at 30% mix	New best FFA. 76.8% vs Phase 13. Self-play recipe validated.
36	96%	Self-play vs P33 (equal strength)	Regression. Equal-strength opponent causes overspecialization.
38	95%	KOTH zone reward (0.3/step), 6 ships	Best overall. 99.6% 6-ship. Zone reward = engagement regularizer.
41	86%	Viper + FocusFire (fresh, new action space)	Ship-specific ability learned (344 uses/match). Fresh start gap vs P38.
43	82%	Lurker + MineDash (fresh, new action space)	99.4% in 6-ship despite 4.3M training kills. MineDash used strategically.
44	—	Tempest + ShrapnelBurst (fresh)	Training.

Win rate across training phases (vs random, 4-ship FFA)

Phase 38 best model (post-LR fix): KOTH training with zone reward (0.3/step) warm-started from Phase 35 produced 95.2% WR in 4-ship FFA, 99.6% in 6-ship, and 73.4% vs Phase 35 head-to-head. The zone reward acted as an engagement regularizer—pulling ships toward center forced more combat, producing a stronger fighter than direct FFA training. Ship-specific models (Viper, Lurker) achieve 97–99% in 6-ship with their unique abilities.

Phase 25: ability usage per match

Self-play doesn't scale (Phases 14–21). One round of self-play works (Phase 12→13). Iterating beyond that causes overspecialization at every mix level tested (50%, 30%, 20%). WR vs random is the canary—if it drops while WR vs the frozen opponent rises, the model is narrowing. The FFA pivot (Phase 23) solved this by using natural opponent diversity instead of artificial self-play.

More compute doesn't fix structural problems (Phases 19–22). Finer aim resolution (21 bins vs 11) plateaued at 43% regardless of step count. Different seeds with identical config produced 48% vs 37%. The breakthroughs (Phase 13, Phase 23) came from structural changes—self-play, FFA—not from more steps or finer control.

Converged policies are stuck (Phases 26–27). Increasing engagement reward, adding self-play opponents, and ablating all 9 reward coefficients individually—none of it shifted the Phase 25 model. Every metric stayed identical. Fine-tuning a converged checkpoint can't escape a local minimum.

The LR bug (Phases 28–31 invalidated)

Phase 28 ran from scratch with minimal reward. Results looked promising. Then a code review revealed the learning rate schedule was counting minibatch steps instead of PPO updates—the LR decayed to zero after less than one real update. Every model from Phase 1 through Phase 31 had trained with LR≈0 for 99%+ of compute. Phase 25's "79% WR" came from less than half a gradient step of actual learning.

The fix was one line in the optimizer setup. Phase 32 REDO—the first properly trained model—hit 94.4% vs random in a single 2B-step run. More improvement in 7 hours than the previous 31 phases combined.

Self-play works, once (Phases 33–37)

Phase 35 warm-started from Phase 33 with Phase 32 as a frozen opponent at 30% mix. Result: 96.2% vs random, 76.8% vs Phase 13—new best across all benchmarks. But the recipe has strict constraints:

Opponent must be clearly weaker. Phase 36 used Phase 33 (roughly equal strength) as opponent. Universal regression—every benchmark dropped. Phase 37 repeated Phase 35's recipe from Phase 35 itself. Plateau—identical results. Self-play against an equal or same-strength opponent causes overspecialization. Against a weaker one (Phase 35 beat Phase 32 66%), it works exactly once.

Zone reward as engagement regularizer (Phase 38)

KOTH training added a per-step zone reward (0.3, vs kill reward of 1.0) pulling ships toward the map center. The intended effect was zone-seeking behavior. The actual effect was stronger: the KOTH model beat the best FFA model 73.4% head-to-head in pure FFA combat. Zone reward forced more engagements (ships near center fight more often), producing richer gradient signal. The best combat model came from training for an objective other than combat.

Ship-specific abilities (Phases 41–44)

Each ship archetype gets a unique ability added to the sim and action space. The model trains from scratch with the expanded preset and learns when to use the ability alongside the 14 base combat actions.

Ship	Ability	Mechanic	6-ship WR	Usage/match
Viper	FocusFire	+25% bullet dmg, +50% speed for 4s	97.2%	344
Lurker	MineDash	Dash 3x speed + drop 2 mines	99.4%	24
Tempest	ShrapnelBurst	8 bullets in 60° forward cone	training

Usage patterns reveal strategic learning: FocusFire activates 344 times per match (pre-engagement buff), while MineDash fires only 24 times (escape/engage tool, not spam). The models discover ability timing from reward signal alone—no explicit "use ability before fighting" shaping.

The Phase 13 breakthrough

Phase 13 was the first model to use all 7 combat abilities meaningfully. Prior models relied almost entirely on bullets and bombs. What changed:

The recipe: strong base model (Phase 12, trained with engage + bullet shaping) + one round of self-play. The base model learned what to do. Self-play taught it when.

Deployment

Trained models export from JAX (Flax parameters) to ONNX format, which UE5's Neural Network Engine (NNE) runs on CPU at inference time. The bot controller builds the same 96-float observation vector as the training sim and decodes the logit output into ship controls. Ship-specific models use different action presets (42–46 logits) but share the same observation space, so one inference pipeline handles all ship types.

JAX Training

Flax Params

.msgpack per ship

→

Export

ONNX

~130–140KB

→

UE5 Runtime

Bot Controller

96 obs → NNE → ship preset

Model size is ~130–140KB (two 128-unit hidden layers). Inference runs in <0.1ms per bot per frame. The game loads the correct model per ship archetype via the ability loadout system—Viper bots use the FocusFire model, Lurker bots use the MineDash model, and generic ships fall back to the base KOTH model.

Methodology limitations

This pipeline follows good experimental hygiene—single-variable changes, multiple baselines, behavioral metrics beyond win rate—but falls short of academic rigor in several ways worth being explicit about.

Single seed per experiment. Each phase runs once with one random seed. Phase 22 demonstrated seed sensitivity: identical hyperparameters produced 48% vs 37% depending on seed. The academic standard is 20+ seeds with confidence intervals. At 13 hours per run on a single RTX 3090, that's 10+ days per experiment—impractical here. Results should be read as "this seed produced X" rather than "this configuration reliably produces X."

Confidence intervals added late. Phases 11-25 were evaluated without confidence intervals. CIs (Wilson score, 95%) are now computed in the eval script. With 500 matches, the 95% CI is roughly ±3-4%, which means Phase 23 (73%) vs Phase 25 (79%) is likely significant, but Phase 24 (74%) vs Phase 23 (73%) probably isn't. Future phases report CIs; historical phases are point estimates only.

Ablation requires from-scratch training. Phase 27 attempted to ablate 9 reward coefficients by fine-tuning from a converged checkpoint (500M steps each). All 9 produced identical behavior—the policy was locked. Post-LR-fix, Phase 32 REDO trained from scratch with pure kill/death reward (all shaping zeroed) and achieved 94.4% immediately—confirming the problem was the LR bug, not reward design.

Learning curves underutilized. Training logs capture entropy, policy loss, and value loss per update, but these aren't systematically analyzed for plateau detection or collapse warnings. Plotting learning curves across phases would give earlier signal on whether a run is worth continuing.

What's next

More ship abilities. Three Lattice-faction item abilities are designed (Hex Burst, Cascade Bomb, Stun Mine) and planned for Bastion and Titan. The action space is forward-compatible—reserved slots for faction-specific items avoid retraining existing models.

Fix multi-opponent rollout. The --opponent flag accepts multiple paths but only uses the first (frozen_pool[0]). Fixing this would enable opponent diversity without manual rotation—training against a pool of 3–5 prior generations simultaneously.

Larger models for ship-specific training. Ship-specific models start from scratch and need 4B+ steps to match the generic model's accumulated lineage (6B+ across P32→P38). A 256×256 network or longer training budget would close this gap.