Jakub Piotr Cłapa

@jpclap

I love to learn new things and use them to solve problems. Currently at @hume_ai Previously built WhisperSpeech an Open Source AI TTS model at @Collabora

Lodz, Poland

Joined February 2014

422 Following

241 Followers

1.7K Posts

jpclap retweeted

Christopher Potts

@ChrisGPotts

5 days ago

We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.

ChrisGPotts's tweet photo. We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining. https://t.co/vqRUUe6whP

913

136

838

134K

jpclap retweeted

Didier 'Dirac's ghost' Gaulin

@DiracGhost

7 days ago

Check out this absolute master piece by Jonathan Gratus titled '' A Pictorial Introduction to Differential Geometry, Leading to Maxwell Equations in 3 Pictures'' which is available on arXiv. To quote the author: ''When I was young, somewhere around 12, I was given a book on relativity, gravitation and cosmology. Being dyslexic I found reading the text torturous. However I really enjoyed the pictures.'' It's a short primer, full of nice figures, perfect for those who love visual examples.

DiracGhost's tweet photo. Check out this absolute master piece by Jonathan Gratus titled '' A Pictorial Introduction to Differential Geometry, Leading to Maxwell Equations in 3 Pictures'' which is available on arXiv.

To quote the author: ''When I was young, somewhere around 12, I was given a book on relativity, gravitation and cosmology. Being dyslexic I found reading the text torturous. However I really enjoyed the pictures.''

It's a short primer, full of nice figures, perfect for those who love visual examples.

821

101

45K

jpclap retweeted

Alessandro Favero @alesfav

9 days ago

AI needs vastly more data than we do. One idea might close the gap: don't predict raw signals (tokens), predict your own abstract latent representation (JEPA, data2vec). With @DanKorchinski @MatthieuWyart, on a toy model, we prove how much that helps: the gap is exponential. 🧵

alesfav's tweet photo. AI needs vastly more data than we do. One idea might close the gap: don't predict raw signals (tokens), predict your own abstract latent representation (JEPA, data2vec).

With @DanKorchinski @MatthieuWyart, on a toy model, we prove how much that helps: the gap is exponential.

🧵 https://t.co/I51Q6Jwiqr

518

466

51K

jpclap retweeted

Aritra

@ariG23498

8 days ago

It has been more than 6 months (on and off) that I am trying to get upto speed with GPU/TPU kernel development. IMHO, profiling should be the starting point of learning this topic. You profile, you question, you look for answers and in the process read and imbibe. I set out on a journey to do just the same. I began profiling gemma4 and was quickly humbled by the amount of information that was at my disposal. The profiler table with huge GEMM names, the profiler trace with too many CPU rows. To make my life easier, I stepped back and profiled a basic matrix multiplication and addition operation, the weights and bias interaction, as one might see it. The profiler artifacts were simple enough to reason and think through. In this blog post, I document my journey and in the process uncover how one should profile and what one should look at! I hope this helps beginners (like me) with a starting point of their kernel development and optimization journey. PS: This is a big blog post, bookmark it and come back to this when you have the time (good weekend read?)

380

411

46K

Who to follow

Ankit Anand

@ankit_s_anand

Staff Research Scientist @GoogleDeepMind Adjunct Faculty @mcgillu and Affiliate @Mila_Quebec Ph.D at @IITDelhi, Masters @iiscbangalore Ardent fan Panchayat

Salman // 萨尔曼

@ForBo7_

「Open to Projects」 • Dabbler • Learner • Explorer • Logger • https://t.co/jTudwv3AAp student • Dabbling in Embodied AI • 自学中文 // Self-learning Chinese

miwojc.llm ;)

@miwojcz

AI. Machine Learning. @fastdotai International Fellow.

Jakub Piotr Cłapa

@jpclap

9 days ago

@halvarflake @__paleologo I learned a ton from every book recommendation from @apenwarr . He also made a very useful observation about sticking to reading about ideas from original sources: https://t.co/LCb8VloBvT

jpclap retweeted

hardmaru

@hardmaru

10 days ago

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://t.co/PK5h0mqQSo), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

153

643

737K

Jakub Piotr Cłapa

@jpclap

13 days ago

@p_mbanugo @davidcrawshaw @hashbreaker I don’t know of a good overview but there is documentation here: https://t.co/LPeLsSEcb9 Example: https://t.co/YA3eqKSJRZ

jpclap retweeted

Yulu Gan

@yule_gan

22 days ago

A fun experiment comparing a random step with one gradient step: With a small CNN on CIFAR-10, a random step is basically a disaster. (A gradient step is a ~185σ event.) That makes sense if you expect a random direction in R^d to be ~sqrt(d) standard deviations worse than the optimal one. So scaling up to a larger model should make things even worse. But with a 7B model (test on GSM8k), random steps have a good chance of outperforming a gradient step. (The gradient norm of one PPO update is 1.94, while the L2 norm of the Gaussian perturbation is 85.6. The figure below rescales the Gaussian perturbation to match the PPO update norm, so the random step and gradient step have the same radius.) We should really rethink the parameter-function map.

137

110

14K

jpclap retweeted

David Crawshaw

@davidcrawshaw

29 days ago

For those who like gory details, the slowness of ethernet is not the physical link layer, you can find reports on specialized systems (specialized NIC, kernel-bypass, no switch, etc) sending a one-way UDP packet in 1.1μs. But ethernet has: kernel syscalls, socket buffers, SKB allocs, routing lookups, driver queues, often a copy, then NIC packet batching, switch store-and-forward paths, then all the same kernel overhead on the other side. That gets you half way. Then you do it all again for the response! NVMe has no packet loss. No scheduling. Just a memory-mapped ring buffer queue consumer on the far end. It is a fundamentally simpler system. It does less and goes faster. Whereas talking to a multi-zone S3 system means routing packets to another computer, with its own kernel and complexities, in a different building through more than one switch. You need all of that complexity built into ethernet systems to make it work. If you want something in the middle, RDMA systems can do more than NVMe and pay less overhead than traditional TCP over Ethernet. But you are probably not doing RDMA between buildings. Bytes are faster if they can be stored locally.

202

29K

Jakub Piotr Cłapa

@jpclap

30 days ago

@eliguerron @steveruizok Maybe try making it so dragging controls position directly (like the hold and drag space to move cursor; maybe be nonlinear if you move faster?) instead of velocity. Velocity is super sensitive to latency and human reaction time…

521

jpclap retweeted

John Crickett

@johncrickett

about 1 month ago

"Nobody reviews compiler output, why review AI code?" Wrong. We do review compiler output. Godbolt exists. Disassemblers exist. Anyone doing serious performance work reads what the compiler produced. The premise is false. But the analogy itself is flawed. It compares two things that aren't comparable. A compiler takes a formal language as input. Languages with grammars and semantics defined precisely enough that "what does this code mean" has only one answer. An LLM takes natural language as input. Natural languages are ambiguous. "Write me a function that handles user input safely" has a thousand valid interpretations and a thousand more invalid ones. The LLM picks one. You don't know which. Unless you look at the code. Compilers are built from specifications and designed to meet them. The output is the result of a defined translation. When the output violates the spec, it's a bug. LLMs are built from whatever was in their training data. There is no spec. There can't be one, natural languages have no defined semantics that map to code. Compilers are semantically deterministic. The same input produces output with the same behaviour, every time. LLMs are not. Partly by design and partly due to hardware variance, batch size, inference order, and floating point operations (and no setting temperature to zero does not address those). All of which can push the same prompt to produce different code. Compilers complain loudly when the input is nonsensical. LLMs fail silently, producing plausible-looking, but wrong code. We trust compiler output because the trust was earned across decades of use, with millions of engineers using the same tools. Early compilers were reviewed heavily. Hand-written assembly was the default because trust hadn't been earned yet. We're at the hand-written assembly stage with AI. We may never get to the trust-the-output stage for the reasons explained above. If you’re a software developer, you should own what goes to production. The compiler analogy is a way of skipping that responsibility.

johncrickett's tweet photo. "Nobody reviews compiler output, why review AI code?"

Wrong. We do review compiler output. Godbolt exists. Disassemblers exist. Anyone doing serious performance work reads what the compiler produced. The premise is false.

But the analogy itself is flawed. It compares two things that aren't comparable.

A compiler takes a formal language as input. Languages with grammars and semantics defined precisely enough that "what does this code mean" has only one answer.

An LLM takes natural language as input. Natural languages are ambiguous. "Write me a function that handles user input safely" has a thousand valid interpretations and a thousand more invalid ones. The LLM picks one. You don't know which. Unless you look at the code.

Compilers are built from specifications and designed to meet them. The output is the result of a defined translation. When the output violates the spec, it's a bug.

LLMs are built from whatever was in their training data. There is no spec. There can't be one, natural languages have no defined semantics that map to code.

Compilers are semantically deterministic. The same input produces output with the same behaviour, every time. LLMs are not. Partly by design and partly due to hardware variance, batch size, inference order, and floating point operations (and no setting temperature to zero does not address those). All of which can push the same prompt to produce different code.

Compilers complain loudly when the input is nonsensical. LLMs fail silently, producing plausible-looking, but wrong code.

We trust compiler output because the trust was earned across decades of use, with millions of engineers using the same tools. Early compilers were reviewed heavily. Hand-written assembly was the default because trust hadn't been earned yet.

We're at the hand-written assembly stage with AI. We may never get to the trust-the-output stage for the reasons explained above.

If you’re a software developer, you should own what goes to production. The compiler analogy is a way of skipping that responsibility.

151

227

51K

jpclap retweeted

Lawrence Chan

@justanotherlaw

about 1 month ago

A recent viral paper claims to reverse-engineer the parameter counts of frontier models: GPT-5.5 = 9.7T, Opus 4.7 = 4.0T, o1 = 3.5T, etc. @ben_sturgeon and I investigated and found serious issues in the paper; fixing them gives GPT-5.5 as ~1.5T (90% CI: 256B-8.3T).

justanotherlaw's tweet photo. A recent viral paper claims to reverse-engineer the parameter counts of frontier models: GPT-5.5 = 9.7T, Opus 4.7 = 4.0T, o1 = 3.5T, etc.

@ben_sturgeon and I investigated and found serious issues in the paper; fixing them gives GPT-5.5 as ~1.5T (90% CI: 256B-8.3T). https://t.co/MbWQyVlmsE

953

416

210K

Jakub Piotr Cłapa

@jpclap

about 1 month ago

@Jonathan_Blow @etscrivner I nice counter-example was camera APIs on Linux and Android. The simple interface is a pointer to continuous physical memory with some metadata. The bad interface is a set of special opaque API endpoints for passing camera images between each of ISP, CPU, DSP and GPU.

Jakub Piotr Cłapa

@jpclap

about 1 month ago

@Jonathan_Blow @etscrivner The reason it worked quite well is because it was a reusable APIs with well understood constrains and invariants. It was not about seek/read/write specifically although Plan9 pushed in this direction. The same design made REST work quite well.

256

Jakub Piotr Cłapa

@jpclap

about 1 month ago

@Jonathan_Blow @etscrivner I think you are comparing excellent API design with poor one-off file formats. Most software design is bad so I’d always take a random static data format over a random API. Databases and standard file formats are a nice middle ground where you’re not held hostage by shitty code.

327

Jakub Piotr Cłapa

@jpclap

about 1 month ago

@nicbarkeragain There was a very simple but fun tree-based algorithm that makes this work for rows with arbitrary (but known) heights: https://t.co/gYZ8k9HxcB and https://t.co/JL0ERxyHde

jpclap retweeted

Zhengyang Geng

@ZhengyangGeng

about 1 month ago

It is kind of funny to still be defending both E2E learning and representation learning in 2026, as if we were back in 2015 or 2019. lol The wild part is that this also looks like magic. People tried to optimize these kinds of metrics in the GAN era, and very often the results just collapsed. Now it works, not because the idea suddenly became obvious, but because enough pieces finally line up: representations, generators, optimization, scale, and the pipeline. Maybe this is the deep learning bitter lesson again: if it is not mathematically impossible, deep learning might eventually make it happen.

609

jpclap retweeted

Jiawei Yang

@JiaweiYang118

about 1 month ago

Also sharing the perspectives from Zhengyang. We have countless, fruitful discussions during this project. At the center, it is how we should view representation in 2026

jpclap retweeted

Jiawei Yang

@JiaweiYang118

about 1 month ago

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.

JiaweiYang118's tweet photo. Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space.

Now it is 0.75, and can be even lower.

Many wonder how.

I thought it might end as a small FID prank: simple and deliberate.

It started with one question: can FID be optimized directly, and what does it reveal?

Introducing FD-loss.

926

156

592

215K

Jakub Piotr Cłapa

@jpclap

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users