λndres Mariscal

@SerialDev

Wrote anti-cheat ml, do ML/AI at places you know off and probably use && into graphics||compilers||DBs I like tech, sloths and, chihuahuas.

Helsinki

Joined July 2015

3K Following

336 Followers

3.1K Posts

Pinned Tweet

λndres Mariscal

@SerialDev

over 1 year ago

@karpathy We could just have a thread on the best talks/ content you've seen there. https://t.co/u4OD2VFb4c 1 HR absolutely fantastic talk ( some follow ups on the llvm dev meetings). I would love the potential of LLMs with a chemistry DSL!

348

SerialDev retweeted

Daniel Lemire

@lemire

1 day ago

Nvidia is proposing a beast of a CPU system for Windows PCs. It has 128 GB of shared memory and comes with up to 6,144 state-of-the-art CUDA cores. CPU wise, the chip has 10 performance cores and 10 efficiency cores. The performance cores are based on the Cortex-X925. These chips appear to support six 128-bit SIMD execution units (SVE2), not as good as recent AMD chips, but better than Apple Silicon (on paper). The game changer is the unified 128 GB memory. That is the path Apple took years ago. Instead of separate memory for the CPU and GPU, everything shares a single pool. It is increasingly popular. The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally. I am not sure how many people will run AI models locally. It still seems like a niche application to me. However, it will make decent machines to play video games. It will be interesting to see how Intel and AMD respond. I think that the AVX-512 instructions supported by all recent AMD processors are far superior to the SVE2 instructions of the Cortex-X925. They can eat more data and they are more versatile. But Intel has been shy, thus far, in making it available on customer systems.

lemire's tweet photo. Nvidia is proposing a beast of a CPU system for Windows PCs.
It has 128 GB of shared memory and comes with up to 6,144 state-of-the-art CUDA cores.

CPU wise, the chip has 10 performance cores and 10 efficiency cores. The performance cores are based on the Cortex-X925. These chips appear to support six 128-bit SIMD execution units (SVE2), not as good as recent AMD chips, but better than Apple Silicon (on paper).

The game changer is the unified 128 GB memory. That is the path Apple took years ago. Instead of separate memory for the CPU and GPU, everything shares a single pool. It is increasingly popular.

The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.

I am not sure how many people will run AI models locally. It still seems like a niche application to me. However, it will make decent machines to play video games.

It will be interesting to see how Intel and AMD respond. I think that the AVX-512 instructions supported by all recent AMD processors are far superior to the SVE2 instructions of the Cortex-X925. They can eat more data and they are more versatile. But Intel has been shy, thus far, in making it available on customer systems.

145

15K

λndres Mariscal

@SerialDev

7 days ago

@Jonathan_Blow Arm is involved on the tweet sequence too, so I would expect unified memory and built-in tpu or something AI-centric

λndres Mariscal

@SerialDev

9 days ago

@stake_jevens @luminal_ai Location agnostic?

113

Who to follow

Kenneth Huang #HCOMP2026 Sep27-30@DC

@windx0303

Running #HCOMP2026 - https://t.co/5AkZWflCzC I create systems that help users. Visit @CornellInfoSci; Associate Prof @ISTatPENNSTATE @PSUCrowdAILab. #HCI #NLProc

Grigore Rosu

@RosuGrigore

Building Fast Infra for agents and their humans (@fastxyz). Formerly @NASA | Prof UIUC @siebelschool | Creator K Framework | Founder @rv_inc.

19 days ago

@lucasmeijer Token attention capacity is still limited, increased context window is not truly increased imo, just look at how its implemented, Grouped-Query Attention ;; RoPE Scaling ;;YARN / CoPhy, theoretically do compress information so it is in essence lossy. less is more

λndres Mariscal

@SerialDev

22 days ago

@rfleury AI has been the first thing that truly made me understand that there is something so much deeper in our creations besides the outcome. I 100% agree even with a similar result there is something unspoken when you can relate to another person's passion. Even in code

272

λndres Mariscal

@SerialDev

about 1 month ago

@lucasmeijer @badlogicgames Godbolt is a regular tab for so many of us lol

SerialDev retweeted

nathan chen

@nathancgy4

about 2 months ago

one of the very first things i worked on after joining kimi was speeding up KDA's kernels with @yzhang_cs and @uniartisan (i got carried :D). it was super fun optimizing those triton kernels... and now comes FlashKDA, a highly efficient KDA in CUTLASS for the open community! side note: knowing how to write a kernel matters less and less, but knowing how it actually works efficiently matters as much as ever. although I rarely write kernels anymore, and instead mostly use kimi k2.6 / opus 4.5-7 to write them—far from optimized ones, simply for the sake of testing for signs of life—for me, those days of trying to make algorithms as hardware-aligned as possible turned out to be special and shaped many intuition for architectural designs that followed. (arch and infra are really two sides of the same coin). would highly recommend reading basic flash/linear attention's triton kernels in FLA (https://t.co/Xf0QRIZdgT) for anyone wanting to better understand how efficient kernels work btw

376

181

48K

λndres Mariscal

@SerialDev

about 2 months ago

100% agree a win for Zig, but disagree with the out-of-distribution generalisation, Zig is fine-tuned for performance "writing" and as a result you a over-represented high quality performance focused corpus (Bun, TigerBeetle, etc. and all of the Mike Acton inspired talks!)

Mitchell Hashimoto

@mitchellh

about 2 months ago

Huge W for Zig, used for inference for K2.6. If you want absolute performance with exacting control over what your CPU executes and the way memory is laid out, Zig is the way.

321

178K

128

λndres Mariscal

@SerialDev

about 2 months ago

@anselmlevskaya SPIR-V and shader langs are underrated too, my time in the gaming industry is wildly underappreciated in ML circles still.

273

SerialDev retweeted

tetsuo.cpp (no slop)

@tetsuo_cpp

about 2 months ago

An awesome thread where @AgileJebrim talks about his custom language, compiler and programming model for GPUs. By restricting certain features/instructions, he is able to guarantee deterministic execution time, making it viable for real-time applications.

λndres Mariscal

@SerialDev

about 2 months ago

@mpweiher I like this one, deleters will be more experienced! but now lets make it more interesting, both teams are juniors same exp, which team do you feel becomes more competent faster?

λndres Mariscal

@SerialDev

about 2 months ago

@HSVSphere Quantisation without notice lol, its actively+measurably better at non-US hours

417

λndres Mariscal

@SerialDev

about 2 months ago

10x compression (32 bytes) looks great on paper, but jumping from 0.034 to 0.117 distortion is a total quality cliff. Johnson-Lindenstrauss lemma. Cutting QJL from 128 to 64 bits doesn't just "lose precision"it breaks the ds preservation guarantees. ideas?

λndres Mariscal

@SerialDev

about 2 months ago

🙃

SerialDev's tweet photo. 🙃 https://t.co/WBzqEpw8MV

SerialDev retweeted

Alex Zhurkevich @cudagdb

2 months ago

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. https://t.co/2aQBwcdnZL

334

260

148K

SerialDev retweeted

Jeremie Pelletier

@HostOfMeta

2 months ago

@ThePrimeagen Strudel REPL but for gamedev; that was proof of concept cljs; going for the moon no re-star'. https://t.co/V3etlQBfwD

151

SerialDev retweeted

Jeet Desai @Jeet2505

2 months ago

Amazing! 🙌@TigerBeetleDB https://t.co/u1kl2S0Q83

SerialDev retweeted

TigerBeetle

@TigerBeetleDB

2 months ago

IronBeetle⚡️ Ep 105 Zig's comptime is A W E S O M E for CLI argument parsing https://t.co/3WeAjomAtr

SerialDev retweeted

John Carmack

@ID_AA_Carmack

2 months ago

Without getting all the way down to performance counters, GPU power from nvidia-smi is a better indicator of true utilization than job scheduling or “gpu busy”. I would love to see animated “heat maps” of the big data centers, with each pixel being an individual GPU’s power draw. I am confident that inference and frontier training at the big labs is highly efficient, but I wonder how many GPUs would be dark due to scheduling and inefficient research code. With a little calibration for base load and peak, just the power bill for the datacenter would be a pretty good first order indicator of utilization.

177

181K

SerialDev retweeted

gengstah @_gengstah

2 months ago

Released WinDbg MCP — attach Claude (or any LLM) to a live Windows process and let it poke around. set breakpoints, read memory, walk the stack, load crash dumps. 55 tools over MCP. https://t.co/Hw2qqEKw4k

271

183

14K

λndres Mariscal

@SerialDev

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users