PersisGod 🍁

@PersisGod

If you think adding a 😂 emoji to your words will cover your inner anger and bitterness, you’re an idiot.

Toronto, Ontario

Joined February 2021

453 Following

89 Followers

2.8K Posts

PersisGod 🍁 @PersisGod

about 8 hours ago

@TrollFootball If it was next level you didn’t need to highlight it

210

PersisGod 🍁 @PersisGod

1 day ago

@theo Have not beaten my record yet, landed 16 PRs in a day

223

PersisGod 🍁 @PersisGod

1 day ago

Asked Claude to fix a CI issue for two days. Two attempts on codex resolve it.

PersisGod retweeted

Kai Yang

@ChihYang04

3 days ago

We got GitHub for electronics before GTA6

637

419K

Who to follow

Sandro F. Groganz

@sandro_groganz

Consultant @ossgtm, WE HELP #opensource GO TO MARKET

Arpan

@ArpanTripathi20

Subnet dev @vidaio_ | ex-AI & Subnet dev @nineteen_ai SN19 | MSc AI & ML @unibirmingham | CS alumni @PalakkadIIT

Marc

@dopsnacky

MSc in AI & Computer Vision. Generative AI & Software Engineering.

PersisGod retweeted

Lior Alexander

@LiorOnAI

3 days ago

Large language models spontaneously develop the same specialized brain regions humans have for language, math, physics, and social reasoning. No one designed this. It just emerged. Two completely different optimization processes (biological evolution vs. gradient descent) independently arrived at the same solution.

290

252

30K

PersisGod retweeted

Jackywine

@Jackywine

27 days ago

牛逼！今天搞点 PDF 论文深入学习发现之前推荐过的开源项目 minerU 都有客户端了，推荐大家试试，真的很牛逼 PDF 啥的往里一丢，直接变格式友好的 Markdown 还要啥自行车啊 https://t.co/GMf9pnYpsx

121

PersisGod retweeted

Andrzej Białecki @Kaszanas

3 days ago · Łomianki

@GoogleResearch @tunguz are we past XGBoost at this point? 😅🤔

PersisGod 🍁 @PersisGod

5 days ago

@RezaeianRamin So proud of you

PersisGod retweeted

Kevin H. Zhao

@ggkhzhao

6 days ago

(1/N) autoresearch 🤝 weather forecasting - a thread 🧵 Can an automatic research loop improve a real weather dynamical core by making physics-informed changes? TBH we weren’t expecting much, but the early results were surprising enough for us to share:

180

163

121K

PersisGod retweeted

Zhaoran Wang

@zhaoran_wang

6 days ago

autoresearch makes ODE/PDE great again without the neural part : ) no backprop, just evolution

293

237

45K

PersisGod retweeted

Avi Chawla

@_avichawla

6 days ago

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: https://t.co/ZwXv7VezVu I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

310

262K

PersisGod 🍁 @PersisGod

11 days ago

@elaraxo0 @grok Is breast feeding children count as body count?

PersisGod retweeted

Sakana AI

@SakanaAILabs

11 days ago

Introducing Sakana Fugu: A full multi-agent orchestration system accessible via a single model API. Our ‘Fugu Ultra’ model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls. Try it: https://t.co/hhO6qTawgb 🐡

38K

30K

26M

PersisGod 🍁 @PersisGod

13 days ago

Machine learning or machine studying

368

PersisGod retweeted

Akshay 🚀

@akshay_pachaar

14 days ago

Karpathy's prediction about RL is coming true now! He called reward functions unreliable and argued that a single reward number is too low-dimensional to teach an agent what "good" means for complex tasks. To solve this, Agents need a knowledge-guided review as a higher-dimensional feedback channel. Every major AI lab trains models with RL today (OpenAI, Anthropic, DeepSeek). And their key bottleneck has always been the reward functions. GRPO by DeepSeek worked well for math and code because the environment gave a binary signal. But for real agent tasks, someone still has to hand-code the scoring function. That takes days and breaks every time the pipeline changes. RULER (implemented in OpenPipe ART, 10k stars) addresses the exact problem Karpathy identified. The reward criteria are defined in plain English, and an LLM evaluates each trajectory against that description to provide feedback for training. I trained a Qwen3 1.4B agent that plays 2048 using GRPO with this exact workflow. In this case, the agent saw the board, picked a direction, and RULER evaluated the outcome, all from this natural language definition. You can see the full implementation on GitHub and try it yourself. Here's the ART Repo: https://t.co/XeTppNyX9p (don't forget to star it ⭐ ) Just like RLHF replaced manual rankings and GRPO replaced the critic model, natural language rewards are replacing hand-coded scoring functions. RL reward engineering is now prompt engineering. I wrote a full walkthrough on OpenPipe's ART, the agent RL trainer built on GRPO, including how RULER replaces manual reward engineering with automatic LLM-graded rewards. The article is quoted below.

173

217K

PersisGod 🍁 @PersisGod

16 days ago

@i_thiink_so 🤣🤣

PersisGod 🍁 @PersisGod

16 days ago

@NolliesCapital @cohere Same

PersisGod retweeted

Superman

@thesupermanmx

17 days ago

MIT just open-sourced a model that could end the $150/hour CAD industry. It’s called GenCAD. It converts photos into fully editable CAD programs. Just upload a sketch or photo and it generates the full parametric 3D model. 100% Open Source.

thesupermanmx's tweet photo. MIT just open-sourced a model that could end the $150/hour CAD industry.

It’s called GenCAD. It converts photos into fully editable CAD programs. Just upload a sketch or photo and it generates the full parametric 3D model.

100% Open Source. https://t.co/mA5PsBfDGO

172

60K

PersisGod 🍁 @PersisGod

19 days ago

@Darky1k NLD: 2 JPN:1

PersisGod retweeted

alphaXiv

@askalphaxiv

24 days ago

“FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention” With how Long context LLMs are being bottlenecked by KV cache, because every old token keeps consuming GPU memory even when most of it is irrelevant, this paper turns long context into retrieval. They used a small Memory Indexer to predict which old KV chunks the model will need soon, keeps only those on GPU, and leaves the rest offloaded. This provides 13.5% average KV cache footprint, up to 90% memory reduction at 500K context, with slightly better accuracy than DS V4 Flash.

askalphaxiv's tweet photo. “FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention”

With how Long context LLMs are being bottlenecked by KV cache, because every old token keeps consuming GPU memory even when most of it is irrelevant, this paper turns long context into retrieval.

They used a small Memory Indexer to predict which old KV chunks the model will need soon, keeps only those on GPU, and leaves the rest offloaded.

This provides 13.5% average KV cache footprint, up to 90% memory reduction at 500K context, with slightly better accuracy than DS V4 Flash.

217

137

PersisGod 🍁

@PersisGod

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users