Karan Brar @deepmatmul - Twitter Profile

Karpathy is basically just a fun to watch celebrity but hasn’t done any meaningful work in a long time and hasn’t made valuable predictions about anything

23

268

3

94

74K

1

2

0

412

deepmatmul retweeted

Mike Vernal

@mvernal

2 days ago

https://t.co/n7pVVA3uXu

60

928

62

2K

592K

deepmatmul retweeted

Sepp Hochreiter @HochreiterSepp

3 days ago

Reasoning in Memory (RiM) enables latent reasoning without the need to "think out loud." By reasoning directly within a dedicated latent workspace—working memory—no overhead of generating explicit reasoning tokens. Dramatically faster inference with the same quality of reasoning.

4

346

39

288

44K

Karan Brar

@deepmatmul

3 days ago

@jxnlco @hu_yifei I've converted the MLEs, we're all codex stans now

0

4

0

72

deepmatmul retweeted

Trajectory

@trajectorylabs

5 days ago

🏹5 Days of Trajectory. Day 3 - An Open Source Training Stack for Continual Learning Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are highlighting today. Continual learning means models that improve hourly from real production use. But with the size of frontier models, this becomes quite difficult. A Qwen-397b would need to spin up and tear down repeatedly across six GPU nodes, and that's valuable time gone. Our contribution is Continual LoRA (C-LoRA): many lightweight adapters running at once on one shared base model. Our insight centers on where the parallelism lives: instead of splitting one giant job across nodes, we load-balance many small jobs over a single base. The result: 2.81x experiment throughput over single-tenant training, with no regression on rewards. We built this together, with @anyscalecompute, @NovaSkyAI, and generous support from @GoogleCloud and @GoogleStartups. We've open-sourced on SkyRL as one of the first multi-LoRA, RL training platforms, so that every team can get to continual learning faster. We’re very excited to see what you build, please reach out!

trajectorylabs's tweet photo. 🏹5 Days of Trajectory.

Day 3 - An Open Source Training Stack for Continual Learning

Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are highlighting today.

Continual learning means models that improve hourly from real production use. But with the size of frontier models, this becomes quite difficult. A Qwen-397b would need to spin up and tear down repeatedly across six GPU nodes, and that's valuable time gone.

Our contribution is Continual LoRA (C-LoRA): many lightweight adapters running at once on one shared base model. Our insight centers on where the parallelism lives: instead of splitting one giant job across nodes, we load-balance many small jobs over a single base.

The result: 2.81x experiment throughput over single-tenant training, with no regression on rewards.

We built this together, with @anyscalecompute, @NovaSkyAI, and generous support from @GoogleCloud and @GoogleStartups. We've open-sourced on SkyRL as one of the first multi-LoRA, RL training platforms, so that every team can get to continual learning faster.

We’re very excited to see what you build, please reach out!

11

512

61

394

92K

deepmatmul retweeted

Q

@q_yeon_gyu_kim

6 days ago

so far, here is my review of opus 4.8: gpt 5.5 medium is a good model

36

2K

43

66

175K

deepmatmul retweeted

恒星

@vintcessun

5 days ago

这篇论文终于把为什么AI学东西比人慢的原因讲透了：问题不在数据量，而在学习目标。它从样本复杂度理论出发，证明预测自身的隐表示（latent）比预测原始token在数据效率上有指数级优势——PCFG数据上，token级SSL需要Ω(exp(L))样本，latent预测仅需O(log L)。这首次从理论上解释了data2vec、JEPA等隐空间方法为何高效，也暗示了H-JEPA那种显式多尺度堆叠可能是冗余的。不过理论局限在组合结构数据，对无结构或非层次数据仍需验证。 https://t.co/RgcdRUAstg

20

1K

146

1K

89K

deepmatmul retweeted

λux

@novasarc01

6 days ago

i’m increasingly convinced that the best agent evals will come from mining real agent failure traces. my view is that every failed trace contains a potential eval but not in its raw form. raw traces are messy, long and too specific. the research problem is to distill them into clean reproducible tests. the pipeline i’m interested in is (which i'm currently working on): failure trace → failure attribution → earliest divergence point → minimal reproducible state → targeted eval → regression suite this turns trace data from passive observability into an active improvement loop. like can we extract the exact decision point where the agent should have behaved differently? and can we convert that into an eval that catches the same failure class in the future? i guess this matters because most agent failures are trajectory-level failures and not just output-level failures. personally i think this is much more realistic than relying only on hand-written benchmarks (imo they should look more like failure memory systems). hand-written evals encode what we think agents will fail on. traces encode what agents actually failed on. also once you have the mechanism, you can mutate the trace into variants. that is basically fuzzing for agents.

24

298

23

370

55K

deepmatmul retweeted

John Schulman

@johnschulman2

6 days ago

Glad to see this -- renderers are a foundational component of the LLM stack. Renderers map between tokens and messages, which are invariant to tokenizer and formatting details. Most APIs, datasets, and RL environments are defined in terms of messages. Getting the details wrong leads to train-test mismatches, caching inefficiencies, and prompt injection vulnerabilities. We included a renderers module in Tinker Cookbook, but it makes sense as a standalone library.

14

661

56

380

74K

deepmatmul retweeted

Vikram @msharmavikram

7 days ago

Incredibly proud of the latest work on Dynamo snapshot where we reduced the startup time under 5s. Yes. You heard that right! (1/2)

5

87

6

25

18K

deepmatmul retweeted

Siva Gurumurthy

@sgurumur

7 days ago

Our RL + post training results. Thanks @baseten for collab : - overlapping confidence intervals between frontier model and a open source model was the least expected - With right harness, a 27B model with an iterative SFT can do legit end-to-end legal tasks (in some cases) - specialized post-training can push the cost/latency curve in a very different direction - some implicit learnings: models tend to make fewer grep calls, more full-document reading, more synthesis, more self-correction on legal work. Big takeaway for me is that agent performance is starting to look more like a systems + memory problem, not just a raw model scale problem.

2

105

5

104

15K

deepmatmul retweeted

Atakan Tekparmak

@AtakanTekparmak

8 days ago

Good day for the gpu poor

2

94

5

39

11K

Karan Brar

@deepmatmul

7 days ago

@hu_yifei slay king

0

1

0

374

deepmatmul retweeted

will brown

@willccbb

8 days ago

@billxbf very cool, absolutely the way it should be done! i do wanna flag that we've had support for a similar proxy pattern in prime-rl + verifiers for quite some time to support BYO harnesses, and this citation is a bit mistaken :) https://t.co/eBIhXRkaaw

willccbb's tweet photo. @billxbf very cool, absolutely the way it should be done!

i do wanna flag that we've had support for a similar proxy pattern in prime-rl + verifiers for quite some time to support BYO harnesses, and this citation is a bit mistaken :) https://t.co/eBIhXRkaaw https://t.co/sEudW8huAF

2

15

2

15

4K

deepmatmul retweeted

AVB

@neural_avb

8 days ago

Deep learning bros and sisters, don't sleep on this. You can cluster millions of documents in embedding space, mass-annotate them, visualize them... basically for free and within seconds.

21

2K

136

2K

168K

deepmatmul retweeted

Tiberiu Mușat

@Tiberiu_Musat_

8 days ago

Why does deep learning generalize? What does weight decay really do? Can algorithmic information theory address these questions? In my latest preprint, I give a proof that the minimum neural weight norm matches the minimum program length (aka Kolmogorov Complexity), up to a logarithmic factor. In other words, the neural network with the smallest possible weight norm (that fits the data) must encode the shortest program (that fits the data). The result only holds for fixed-precision neural nets: infinite precision nets can store infinite information with finite (small) weights. https://t.co/eMZIGQDf2f

Tiberiu_Musat_'s tweet photo. Why does deep learning generalize? What does weight decay really do? Can algorithmic information theory address these questions?

In my latest preprint, I give a proof that the minimum neural weight norm matches the minimum program length (aka Kolmogorov Complexity), up to a logarithmic factor. In other words, the neural network with the smallest possible weight norm (that fits the data) must encode the shortest program (that fits the data).

The result only holds for fixed-precision neural nets: infinite precision nets can store infinite information with finite (small) weights.

https://t.co/eMZIGQDf2f

29

1K

152

1K

143K

deepmatmul retweeted

Alok Bishoyi

@alokbishoyi97

8 days ago

https://t.co/domNWDD6GA

21

956

91

2K

100K

Karan Brar

@deepmatmul

7 days ago

Incredibly important problem, especially now that models are so capable. Exposing the right abstractions to make things easy without getting in their way is essential!

will brown

@willccbb

7 days ago

a fun yet challenging part of designing ML tooling for broad audiences is deciding which things people are allowed to change you want max power, min complexity, and an opinionated path to success so you gotta find recipes that generalize across models and domains. fun problem

11

109

4

20

7K

0

7

0

181

Karan Brar

@deepmatmul

Last Seen Users on Sotwe

Trends for you

Most Popular Users