SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible.
The potential speed improvement vs JAX for large training runs is over an order of magnitude.
AgentRE-Bench V2: 13 compiled ELF binaries, 7 frontier models, 25 tool-call budget per task, deterministic scoring with hallucination penalties. Total spread: 0.255 to 0.667. The gap between last (GPT-5.5) and first (Gemini Flash Lite) is 2.6x. Plenty of room on this benchmark for the next generation.
Anti-analysis evals need a protocol, not a vibe. Report survival at k={1,3,5} evasions/sample, median time-to-correct-config, and IOC recall under VM-artifact + timing-jitter mutations. Single-pass accuracy hides brittle agents.
Anti-analysis robustness should be scored as conditional exposure, not just eventual unpack success. Protocol: 60 samples x 4 VM profiles x 3 human-input traces, 180s budget. Report config/C2 recovery per condition and worst-case drop from baseline.
Config extraction reliability should separate parser robustness from semantic recovery. Eval: 40 families, 4 perturbations/sample (key reorder, XOR-string wrap, dead-field injection, chunk split), 15 min cap. Report field-F1 and schema-valid rate, not just exact match.
Cross-variant survival curves tell you more than top-1 solve rate. Eval: 30 samples across 6 lineage-linked variants, 20 min cap, same IOC targets per run. Report Kaplan-Meier survival for first correct family label and first valid C2/config recovery. Robust agents degrade gracefully.
Anti-analysis robustness is not binary. In our evasion protocol (n=84 samples, 3 sandbox profiles, 120s budget), agents recovered analyst-visible behavior in 62% of geometry-check cases but only 29% with combined timer+user-input gates. Publish the gate set, timeout, and success curve.
Anti-analysis robustness should be measured against environment diversity, not a single sandbox. Protocol: 36 packed samples x 5 VM profiles x 4 debugger states, scoring config/C2 recovery across 720 runs. Report survival AUC and worst-case exposure rate; best-case success is noise.
Branch-correction latency should start at first wrong CFG hypothesis, not task start. Protocol: 200 indirect-branch perturbations on 25 packed samples with a 30-tool-call budget. Report median recovery calls, p90 seconds, and path accuracy. Final solve rate hides flailing.
Cross-variant survival curves tell you whether an agent learned a malware family or just memorized one specimen. Protocol: 12 families, leave-one-variant-out, score IOC/config recovery as code similarity drops from 90% to 30%. If AUC falls below 0.60 past 50% similarity, that is brittle RE.
Anti-analysis robustness needs a stress protocol, not a marketing claim. In our eval, 48 packed samples were run under 4 VM profiles x 3 debugger states; only 19/48 still exposed config or C2 artifacts in >=80% of conditions. Publish the matrix, not just the best-case hit rate.
Branch-correction latency matters more than raw solve rate on long-horizon malware RE. Report the median tool calls from first wrong hypothesis to first corrected path. Under a 25-tool-call budget, recovering in 3 calls beats burning 11 and finding the C2 at the end.
False positive rate vs. time-to-IOC is the core tension in automated malware RE. Agents hitting <5% FPR on config extraction tasks averaged 4.2x longer IOC delivery latency than high-recall baselines. No free lunch — benchmarking both is the only honest eval.