@karpathy We could just have a thread on the best talks/ content you've seen there.
https://t.co/u4OD2VFb4c 1 HR absolutely fantastic talk ( some follow ups on the llvm dev meetings). I would love the potential of LLMs with a chemistry DSL!
Nvidia is proposing a beast of a CPU system for Windows PCs.
It has 128 GB of shared memory and comes with up to 6,144 state-of-the-art CUDA cores.
CPU wise, the chip has 10 performance cores and 10 efficiency cores. The performance cores are based on the Cortex-X925. These chips appear to support six 128-bit SIMD execution units (SVE2), not as good as recent AMD chips, but better than Apple Silicon (on paper).
The game changer is the unified 128 GB memory. That is the path Apple took years ago. Instead of separate memory for the CPU and GPU, everything shares a single pool. It is increasingly popular.
The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
I am not sure how many people will run AI models locally. It still seems like a niche application to me. However, it will make decent machines to play video games.
It will be interesting to see how Intel and AMD respond. I think that the AVX-512 instructions supported by all recent AMD processors are far superior to the SVE2 instructions of the Cortex-X925. They can eat more data and they are more versatile. But Intel has been shy, thus far, in making it available on customer systems.
@lucasmeijer Token attention capacity is still limited, increased context window is not truly increased imo, just look at how its implemented, Grouped-Query Attention ;; RoPE Scaling ;;YARN / CoPhy, theoretically do compress information so it is in essence lossy. less is more
@rfleury AI has been the first thing that truly made me understand that there is something so much deeper in our creations besides the outcome. I 100% agree even with a similar result there is something unspoken when you can relate to another person's passion. Even in code
one of the very first things i worked on after joining kimi was speeding up KDA's kernels with @yzhang_cs and @uniartisan (i got carried :D). it was super fun optimizing those triton kernels... and now comes FlashKDA, a highly efficient KDA in CUTLASS for the open community!
side note: knowing how to write a kernel matters less and less, but knowing how it actually works efficiently matters as much as ever.
although I rarely write kernels anymore, and instead mostly use kimi k2.6 / opus 4.5-7 to write them—far from optimized ones, simply for the sake of testing for signs of life—for me, those days of trying to make algorithms as hardware-aligned as possible turned out to be special and shaped many intuition for architectural designs that followed. (arch and infra are really two sides of the same coin).
would highly recommend reading basic flash/linear attention's triton kernels in FLA (https://t.co/Xf0QRIZdgT) for anyone wanting to better understand how efficient kernels work btw
100% agree a win for Zig, but disagree with the out-of-distribution generalisation, Zig is fine-tuned for performance "writing" and as a result you a over-represented high quality performance focused corpus (Bun, TigerBeetle, etc. and all of the Mike Acton inspired talks!)
Huge W for Zig, used for inference for K2.6. If you want absolute performance with exacting control over what your CPU executes and the way memory is laid out, Zig is the way.
An awesome thread where @AgileJebrim talks about his custom language, compiler and programming model for GPUs. By restricting certain features/instructions, he is able to guarantee deterministic execution time, making it viable for real-time applications.
@mpweiher I like this one, deleters will be more experienced! but now lets make it more interesting, both teams are juniors same exp, which team do you feel becomes more competent faster?
10x compression (32 bytes) looks great on paper, but jumping from 0.034 to 0.117 distortion is a total quality cliff. Johnson-Lindenstrauss lemma. Cutting QJL from 128 to 64 bits doesn't just "lose precision"it breaks the ds preservation guarantees. ideas?
Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy.
https://t.co/2aQBwcdnZL
Without getting all the way down to performance counters, GPU power from nvidia-smi is a better indicator of true utilization than job scheduling or “gpu busy”. I would love to see animated “heat maps” of the big data centers, with each pixel being an individual GPU’s power draw.
I am confident that inference and frontier training at the big labs is highly efficient, but I wonder how many GPUs would be dark due to scheduling and inefficient research code.
With a little calibration for base load and peak, just the power bill for the datacenter would be a pretty good first order indicator of utilization.
Released WinDbg MCP — attach Claude (or any LLM) to a live Windows process and let it poke around. set breakpoints, read memory, walk the stack, load crash dumps. 55 tools over MCP.
https://t.co/Hw2qqEKw4k