Kimbo @kimbochen - Twitter Profile

about 22 hours ago

What’s with this MAI model Every one and their grandmas are making 200-tweet threads on MAI My whole feed is just snippets of the MAI paper lmao

4

37

0

1

4K

kimbochen retweeted

snow

@snowclipsed

1 day ago

what an awesome paper. time to read, will be sharing thoughts below.

1

58

5

31

7K

kimbochen retweeted

Vishal

@KyrieBlunders

1 day ago

Spent a month writing CuteDSL kernels for an RL training loop. The fused decode-attention kernel I wrote benchmarked 2.2x faster than the SDPA path it replaces. Dropped it into HF generate. The decode step got 3x slower. [Metrics and Explanation below]

7

82

7

49

7K

Kimbo

@kimbochen

1 day ago

Me singing along to @tuki_music_ Bansanka chorus Me: 何十回 Tuki.: 何百回 Me: 何百回 Tuki.: 何千回 Me: 何千回 Tuki.: 何万回 Me: Damn it

0

1

0

129

Who to follow

Fabian Pedregosa

@fpedregosa

Keeping the gradients flowing since 2013. Loves open source. Sometime blogs and writes papers.

Ujan

@Ujan55906689

NLP PhD @UNSW

Rishabh Anand (in SF)

@rishabh16_

backpropagating @yale • geometric generative modeling for RNA and proteins • prev @genentech @cambridge_cl @nusingapore 🧬🛠

Kimbo

@kimbochen

2 days ago

@tenderizzation lmao When are you inventing a new RL algo called PeePeePooPO

0

3

0

168

Kimbo

@kimbochen

2 days ago

Tender doing a Karpathy-esque coding project? No way *Sees the pee pee poo poo repo name* Ah, that’s the Tender I know

tender

@tenderizzation

2 days ago

<100% human-generated slop alert> I don't actually know anything about RL, so inspired by @ericjang11 's podcast with Dwarkesh I was curious to see what it would take to crudely implement a version of PPO for connect 4 (a mathematically solved game but still interesting IMO). The only rule I gave myself was that I was not allowed to vibe-code, only to "inverse" vibe-code. Meaning I only allowed myself to ask LLMs stack overflow type questions to check my intuition and understanding from reading the PPO paper. I would have to write my own code. In this sense the LLM was prompting me to write code, hence "inverse" vibe-coding. The shape of these questions looked like e.g., "it doesn't make sense to record an autograd graph during rollout right since we don't actually backprop the rollouts" and "this value is frozen after rollout right?" I used Gemini 3.5 Flash this whole time and it's excellent for questions on well-established research. For example, some of parts of the original PPO paper would have been tedious to implement for my purposes, so checking with Gemini that e.g., I could ignore the discounting of advantage estimates saved me a lot of time. I started with a scaled down version that plays tic-tac-toe to sanity check my implementation. Once I got a reliable setup that yielded a model that would tie every game I moved on to connect 4. Currently a convnet with a few hundred thousand parameters trained on 32 million moves can play a casual game of connect 4 well (blocking obvious moves) but is straightforward to defeat for a skilled human. I was pleasantly surprised by the state of CPU-only "deep" learning. A singlethreaded implementation on my 5600X achieves ~1600 moves/s so a small training run is easily doable in half a day or so. Overall I'm very bullish on this way of learning things as inverse vibe-coding gives me a very fast feedback loop but doesn't make me feel like I'm outsourcing my thinking.

12

112

1

35

7K

1

10

0

1

2K

kimbochen retweeted

SemiAnalysis

@SemiAnalysis_

2 days ago

Your RL training efficiency is only as good as your sandbox infra. Check out what Modal does to keep your rollouts rolling!

1

75

7

26

21K

kimbochen retweeted

Perplexity

@perplexity_ai

2 days ago

Introducing Search as Code, our new search architecture for AI agents. It writes Python that calls our search stack directly, instead of looping through function calls one at a time. Available in the Perplexity Agent API, and now default in Computer. https://t.co/ut6GGWQTVO

perplexity_ai's tweet photo. Introducing Search as Code, our new search architecture for AI agents.

It writes Python that calls our search stack directly, instead of looping through function calls one at a time.

Available in the Perplexity Agent API, and now default in Computer.

https://t.co/ut6GGWQTVO https://t.co/jrF2nQE3bC

147

2K

189

1K

542K

Kimbo

@kimbochen

2 days ago

@fabknowledge @ grok is Fab talking about me Explain to me like a five year old

0

1

0

378

Kimbo

@kimbochen

2 days ago

@LLMenjoyer Where do you find those unsettling videos lol

1

5

0

328

kimbochen retweeted

Luke J. Huang

@whatthelukh

2 days ago

New blog! Is frontier asynchronous RL solved? The blog covers Async RL theory and infrastructure, surveying 8 open-weight frontier labs for the algorithmic techniques and systems fixes to handle train-inference mismatch. Also answered: why do current methods still fail at high policy lag? Which methods scale with horizon and compute?

whatthelukh's tweet photo. New blog! Is frontier asynchronous RL solved?

The blog covers Async RL theory and infrastructure, surveying 8 open-weight frontier labs for the algorithmic techniques and systems fixes to handle train-inference mismatch. Also answered: why do current methods still fail at high policy lag? Which methods scale with horizon and compute?

16

1K

131

2K

230K

kimbochen retweeted

MiniMax (official) @MiniMax_AI

3 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

528

8K

1K

3K

3M

kimbochen retweeted

Nan Jiang

@nanjiangwill

4 days ago

At @modal, we're working to make sure OSS RL frameworks have all the techniques necessary to train frontier open-weights models. Delta compression is key, but the job's not done. There are still lots of open problems around weight sync, auto-scaling, & cross-cluster training. My DMs are open!

7

234

20

104

54K

kimbochen retweeted

Kyle Kranen

@KranenKyle

4 days ago

About a month ago I posted about ongoing work on datacenter scale inference simulation. People seemed to like it so we wrote more about it! Check out this awesome blog post from the Dynamo team!

3

22

3

2K

kimbochen retweeted

Shuyao Tim Xu

@TimXu222575

15 days ago

This is very intuitive and nicely executed. System reminder/system prompt distillation is very popular in industry, but mostly done in off policy fashion. In that setup, we write scripts to "rejection sample" out hint leakage. In the end, it is a comparison between on-policy and off-policy with rejection sampling. Which is better?

1

4

2

5

1K

Kimbo

@kimbochen

4 days ago

@nabla_theta Sent!

0

32

kimbochen retweeted

Fuli Luo

@_LuoFuli

5 days ago

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions Read the full technical blog: https://t.co/B5tp4tdnim The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline. Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.

51

931

105

406

128K

Kimbo

@kimbochen

6 days ago

@tenderizzation @bilaltwovec

0

19

0

3

802

Kimbo

@kimbochen

6 days ago

@simon_mo_ @charles_irl Was Inferconnect on the list?

1

0

172

kimbochen retweeted

OpenBMB

@OpenBMB

6 days ago

Excited to see some of our OPD research ideas land in the @verl_project ecosystem. 🎉 A few weeks ago, we released our work on On-Policy Distillation (OPD): Rethinking On-Policy Distillation of Large Language Models. In that paper, we introduced several diagnostic signals to understand why OPD succeeds — or mysteriously fails — at the token level. Today, those diagnostics have been merged into the verl training framework. 🔥 The new metrics track: 🔷 top-k overlap between student and teacher token distributions. 🔷 overlap token advantage during distillation. 🔷 alignment dynamics on high-probability teacher tokens. These signals came directly from our OPD analysis: successful distillation is driven by overlapping high-probability token regions, while non-overlapping regions contribute almost no effective optimization signal. 🔥 What’s especially exciting is that this is no longer just a research observation: the same ideas have already been used in MiniCPM5 training. Research → tooling → real models. Big thanks to the verl community for the collaboration and merge! 🤝 📄 OPD Paper: https://t.co/KEMpOQ77ev 💻 Code Repo: https://t.co/aVzzlkf8mN 🎯 PR: https://t.co/9OsuFcYxZs

1

24

8

10

2K

Kimbo

@kimbochen

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users