We’re teaming up @Palmeiras, the first football club to meaningfully build upon TacticAI: our AI system that can help simulate field scenarios and predict open play dynamics up to 8 seconds in advance. ⚽
this model is the opposite of mythos.
Its small, cost effective, apache 2.0, and locally deployable. This is the way LLMs should go.
small, open source, transparent and sovereign
vs
large, expensive, proprietary and hegemonic
In case you're curious about why dynamic workflows are so powerful and the future, read the RLM paper! Opus 4.8 + dynamic workflows in Claude Code is perhaps the first instance of a frontier model seriously trained to be an RLM.
I suspect within a year they'll just become the standard for nearly all coding agent interactions.
It’s funny….every AI startup deck claims a data moat. 5% actually have one. Would your data be impossible to replicate even if a competitor raised $500M tomorrow? If yes cool you have a business.
We've integrated the Luminal compiler on Positron AI chips.
Our first major non-GPU compiler target is Positron Atlas, a bandwidth-focused inference accelerator.
A fun 48-hour run of letting an RLM iteratively building the interface for an RLM to play Pokemon Red (sneak peak of some fun things cooking at @PrimeIntellect😄). The interface generating RLM was just tasked with getting the RLM (same scaffold) to beat the game in under 5 hours wall-clock time.
I originally expected the RLM to design some components used in Gemini Plays Pokemon like an extra map, an interface to parse the screen, etc., design low-level policies that would run fast on the emulator, and also design a good prompt and strategy around the RLM to use sub-agents to explore game state with checkpointing, use RNG manipulation in its favor, etc.
Instead the RLM eventually just decided to give the RLM a `write_memory` tool, which the RLM player decided to use to 1) warp the player immediately to the Elite 4; 2) give itself a level 100 Mewtwo (which it mistakes to be a Ponyta due to weird Pokedex ID vs. internal ID); 3) give itself $999999; 4) give itself all 8 badges by setting the right flag. It then went ahead and destroyed the Elite 4 and Blue and beat the game in record time :p
You'll also notice in the video there's weird backtracking and frame-skipping, this happens because it also did incorporate the strategy of launching sub-agents to explore action trajectories, but had a strange way of saving the frames and recording them (so you see the result of several sub-agent explorations).
We'll have some more funny and cool RLM demos soon, but it's cool to see RLMs work as general-purpose agents (both the coding agent that designs the interface and the game-playing agent itself)!
Install ntn, the Notion CLI.
It brings the entire Notion API to your terminal, plus everything you need to build and deploy Workers. Built for humans and coding agents alike.
Install with: curl -fsSL https://t.co/2dJqE3YHvw | bash
In a regular setting, every agent recomputes the same prefix and holds a GPU slot while waiting on tool calls.
BatchAgent fixes this: warm the prefix once, coalesce duplicate tool calls, release GPU slots during tool waits.
works with SGLang, vLLM, and Dynamo. 2/2
github: https://t.co/J8YEqc4FXK
built BatchAgent; a Python SDK for running many agents against one shared inference backend.
100 parallel OpenCode sessions on H100 + SGLang: 573s → 191s wall-clock, 1.28M → 50K prefill tokens, 96% less compute.
1/2
(4/5) One thing we’ve built is a “kittens” virtual machine that takes over the whole GPU and allows new kinds of co-optimization. We can go past the traditional sequential kernel model – for example, fusing entire training runs into a single kernel and even weirder stuff.
We’ve partnered with @AMD, @Broadcom, @Intel, @Microsoft, and @NVIDIA, to release Multipath Reliable Connection (MRC), a new open networking protocol that helps large AI training clusters run faster and more reliably, with less wasted GPU time.
https://t.co/AiV952AJXs
I had a good time visiting CMU a couple of weeks back, but I think the highlight was lecturing about the stuff we're doing with OxCaml at Hype for Types, a student-organized PL class at CMU.
Really excellent work by the inference team to serve this model so efficiently!
To a significant degree, we have to become an AI inference company now.
The two results from this are speed, and quality. Since INT4 weights load 4x fewer bytes from HBM. At production batches, ExQ is 20-27% faster than SGLang's default fp16 serving.
As for quality: by keeping hot experts at higher precision, ExQ recovers more than half the quality you lose going to INT4 at the same memory cost as uniform INT4
3/4