Been fun working on this over the past few weeks with the team.
We wake up every day looking for people who are still unknown to the world but won’t be forever.
The founders we back today will one day have biographies of their own.
@novaholdings
"With that, we reframed multimodal generation as structured text/code generation"
Text is ambiguous but code is not. Would love to see more results in having LLM natively think like its coding.
1/ Our new @reve image model is now #2 on the @arena text-to-image leaderboard — behind only GPT Image 2, ahead of Nano Banana Pro, Microsoft, xAI and everyone else.
And it's a 125 point jump over Reve 1.5 from just 3 months ago.
The research story behind it 🧵👇
Glad to see this -- renderers are a foundational component of the LLM stack. Renderers map between tokens and messages, which are invariant to tokenizer and formatting details. Most APIs, datasets, and RL environments are defined in terms of messages.
Getting the details wrong leads to train-test mismatches, caching inefficiencies, and prompt injection vulnerabilities. We included a renderers module in Tinker Cookbook, but it makes sense as a standalone library.
For the last few months I've been working on a from-scratch implementation of AlphaGo, a 2016 AI breakthrough that inspired me to get into deep learning. My casual understanding of AlphaGo was "search-augmented deep neural networks trained with self-play", but I wanted to go deeper and understand it by creating it.
Frontier deep learning research has always been expensive, but any given capability gets cheaper very quickly. In 2026, you no longer need DeepMind's resources to train a strong Go AI - you can vibe code all of it yourself for just a few thousand dollars of rented compute.
It was a huge honor to be invited to teach this with @dwarkesh_sp on @dwarkeshpodcast
I am an AlphaGo & Go apprentice, not a master, so all factual errors in the podcast are mine.
Web version of tutorial: https://t.co/Xkf9VsgtuT
Code: https://t.co/rWKOwclPDg
Play the go bot here: https://t.co/aVglJXldVX
1. (System design) - The Interaction Models see your screen and collaborates with you live. Here we're building a scalable system architecture together — no copy-pasting, no switching tabs, just thinking out loud and drawing on the screen together.
Codex grew programmatic policies with no neural nets: max score on Breakout, and SOTA-level scores on MuJoCo.
Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it's the next paradigm.
https://t.co/1ZaIneleuW
Today, we are thrilled to officially launch RadixArk with $100M in Seed funding at a $400M valuation. The round was led by @Accel and co-led by @sparkcapital.
RadixArk exists to make frontier AI infrastructure open and accessible to everyone. Today, the systems behind the most capable AI models are concentrated in a small number of companies. As a result, most AI teams are forced to rebuild training and inference stacks from scratch, duplicating the same infrastructure work instead of focusing on new models, products, and ideas.
RadixArk was founded to change that. We are building an AI platform that makes it easier for teams to train and serve the best models at scale.
RadixArk comes from the open-source community. We started with SGLang, where many of us are core developers and maintainers, and expanded our work to Miles for large-scale RL and post-training. We will continue contributing to both projects and working with the community to make them the strongest open-source infrastructure foundations for frontier AI.
We would like to thank our long-term partners, contributors, and the broader SGLang community for believing in this mission. We're also grateful to @Accel and @sparkcapital, NVentures (Venture capital arm of @nvidia), Salience Capital, A&E Investment, @HOFCapital, @walden_catalyst, @AMD, LDVP, WTT Fubon Family, @MediaTek, Vocal Ventures, @Sky9Capital and our angel investors @ibab, @LipBuTan1, Hock Tan, @johnschulman2, @soumithchintala, @lilianweng, @oliveur, @Thom_Wolf, @LiamFedus, @robertnishihara, @ericzelikman, @OfficialLoganK, and @multiply_matrix among others.
Thanks for the exclusive interview with @MeghanBobrowsky at @WSJ about our vision.
widespread problem of lack of co-design principles. its labs’ responsibility to close the feedback loop of hardware, model, inference, and harness
meanwhile people who co-design their personal harness and infra will extract disproportionate value
We're Not Wasting Tokens — We're Wasting the Design Margin of the Entire Inference Stack
A few days ago I read a post by Fuli Luo on Twitter, discussing Anthropic's decision to cut off third-party harnesses (OpenClaw) from using Claude subscriptions, and the design thinking behind MiMo's Token Plan pricing. Her core argument: global compute capacity is seriously falling behind the token demand created by agents. The way forward isn't selling tokens cheaper in a race to the bottom — it's the co-evolution of "more efficient agent harnesses" and "more powerful, efficient models."
I read it several times over. People who build inference engines have long been frustrated by how wastefully agent frameworks burn through tokens. She articulated something the industry has tacitly acknowledged but rarely stated plainly — and she did it with precision and restraint: the compute allocation crisis we face today is not fundamentally about insufficient compute. It's about tokens being spent in the wrong places.
I want to push this one layer deeper, from my own perspective.
I'm a heavy user of Claude Code — I make no attempt to hide that. You can check that all the latest code in SGLang Omni was built with Claude Code powering my workflow. Its commercial success is beyond question; it genuinely gave many people (myself included) their first real experience of "coding with an agent." But I'm also an inference engine developer — my day job is figuring out how to push prefix cache hit rates higher, how to make KV cache memory layouts more efficient, how to drive down the cost of every single inference request. So when I plugged Claude Code into a local inference engine and started observing the actual request patterns it generates, my reaction was — how to put it — like a water engineer who spent months designing a conservation system, only to watch someone water their garden with a fire hose.
I measured Claude Code's cache hit rate on my local serving engine over the course of a day. The numbers were painful. This isn't a case of "decent but room to improve." It's a case of "the prefix cache mechanisms we carefully engineered at the inference layer are being almost entirely defeated." Fuli Luo mentioned that OpenClaw's context management is poor — firing off multiple rounds of low-value tool calls within a single user query, each carrying over 100K tokens of context window. Frankly, Claude Code's own context management is nowhere near making proper use of prefix cache or any of the other optimizations we've built into inference engines. Many people have already noticed — for example, the resume feature has a bug that causes KV cache misses entirely, which is borderline absurd. I'll say it plainly: the way sessions construct their context was never seriously designed with cache reuse in mind from the start.
Perhaps Anthropic has internal trade-offs we can't see — after all, they control both ends of the stack, model and inference, and can theoretically do optimizations at the API layer that are invisible to us. But from the external behavior I can observe, enormous volumes of tokens are being spent on: re-transmitting already-processed context, re-parsing already-confirmed tool call results, and maintaining an ever-inflating conversation history with extremely low information density. If this is merely to earn more on inference token charges, I find it genuinely regrettable. But many Claude Code users are on subscriptions — burning more tokens is fundamentally a cost burden for Anthropic, not revenue. I honestly don't understand what purpose such inefficient context management serves for Claude Code.
Here's a bold hypothesis: for those long sessions that consume 700K+ tokens, there is certainly a way to restructure the session's context so it accomplishes the exact same task with 10% of the tokens. Not by sacrificing quality, but through smarter context compression, more rational prefix reuse strategies, and more precise tool call scheduling. This isn't theoretical speculation — anyone who has worked on inference engine optimization, upon seeing current agent framework request patterns, would arrive at a similar conclusion.
Fuli Luo is right: global compute capacity can't keep up with the token demand agents are creating. But I'd add that a significant portion of that gap is an illusion of prosperity — artificial demand manufactured by the crude design of agent frameworks.
Here's an analogy I keep coming back to. I've always liked bringing up RAM bloat — in 1969, 64KB of memory sent Apollo to the moon. In 2026, I open a single webpage and 500MB of memory usage is nothing unusual. Every generation of hardware engineers pushes memory capacity higher, and every generation of software engineers lavishly fills it to the brim. People have gotten used to this cycle, even come to see it as the normal cost of progress.
But LLM inference is different. The cost of RAM bloat is your computer running a bit slower, spending a couple hundred bucks on a memory upgrade — users barely notice. The cost of token bloat is real money — GPU cluster electricity bills, user subscription fees, the industry's entire compute budget. And this cost scales exponentially as agent usage grows. If we don't establish the engineering discipline that "tokens should be used efficiently" in the early days of the agent era, the cost of catching up later, once scale kicks in, will be beyond imagination.
Fuli Luo notes that Anthropic cutting off third-party harness subscription access is objectively forcing these frameworks to improve their context management. I agree with that assessment, but my gut feeling is that this shouldn't stop at "third-party frameworks need to be more frugal with tokens." It should trigger a more fundamental reflection: what kind of agent-inference co-design do we actually need?
Right now, agent frameworks and inference engines are essentially fully decoupled — agent frameworks treat the inference engine as a stateless API, sending the full context with every request. Meanwhile, the inference engine does its best with prefix matching, caching whatever it can. This architecture is simple and general-purpose, but brutally inefficient for long sessions. If agent frameworks could be aware of the inference engine's cache state and proactively construct cache-friendly requests — if inference engines could understand the session semantics of agents and make smarter cache eviction decisions — once that information channel between the two opens up, the potential gains in token efficiency are enormous.
Of course, maybe I'm overthinking this. Maybe the market's ultimate answer is: compute gets cheap enough, waste is fine. Just like the RAM story — in the end, everyone chose "memory is big enough, no need to optimize." But I don't think the token economy will follow the same path, at least not in the near term — because the supply elasticity of GPU compute is far lower than that of DRAM. Under compute constraints, token efficiency isn't a "nice to have" optimization — it's the core competitive advantage that determines who survives.
Most people love hearing "we made the model bigger," "we stretched the context window to a million tokens," "we stacked HBM to new heights" — these narratives are sexy, shareable, fundable. But I seriously believe that "finding ways to reduce the reckless waste of tokens" is a profoundly underestimated direction. This isn't a defensive optimization. It's an offensive capability — whoever first achieves an order-of-magnitude reduction in token consumption at equivalent quality can serve ten times the users on the same compute budget, or deliver ten times the agent depth to a single user.
The agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it most wisely. This line from Fuli Luo resonates deeply with me. But I want to press further: who gets to define "wisely"? The people building models? The people building inference engines? The people building agent frameworks? I think the answer is — all three must come to the table together. And right now, we're nowhere close.