Crazy Monkey

@ethcat12

A Passionate Gamer, Cat lover and Web 3.0 enthusiast.

web 3.0

Joined June 2018

61 Following

18 Followers

980 Posts

ethcat12 retweeted

Grigory Sapunov

@che_shr_cat

about 23 hours ago

1/ A 5M-parameter model just beat frontier LLMs on hard logical puzzles at less than 1/100,000th of the inference cost. How? By scaling test-time compute in continuous latent space rather than discrete token space. Let's unpack how this works. 🧵

che_shr_cat's tweet photo. 1/
A 5M-parameter model just beat frontier LLMs on hard logical puzzles at less than 1/100,000th of the inference cost.

How? By scaling test-time compute in continuous latent space rather than discrete token space.

Let's unpack how this works. 🧵 https://t.co/fYZIyEFqpI

646

664

26K

ethcat12 retweeted

Amiri Hayes @amirihayes_

1 day ago

What if attention were code? We show that many attention heads in transformer LMs can be replaced by human-readable Python programs. Swap them in and the model barely notices. See our experiments here: Explaining Attention with Program Synthesis [https://t.co/tkFopEYtaV]

799

686

118K

ethcat12 retweeted

Pengrui Han (Barry)

@pengrui_han

about 4 hours ago

The human brain is strikingly modular, with distinct networks for language, formal reasoning, social reasoning, and physical reasoning. Is this a fundamental principle of how intelligent systems are built, or an accident of biological evolution? In our latest preprint, we find that a similar modular organization emerges in Large Language Models, another class of intelligent system. Brains and LLMs are shaped by entirely different kinds of optimization (biological evolution vs. gradient descent). That they arrive at the same modular design anyway suggests modularity may be a fundamental property of intelligent systems. 🌐 Web: https://t.co/ZKrnTSSuSf 📄 Paper: https://t.co/ZibBXz3PUy 💻 Code & data: https://t.co/uBo5iOYNjy Using circuit analyses across 46 tasks spanning four cognitive domains, we find: 1️⃣ Tasks that draw on the same network in humans recruit overlapping units in LLMs, while tasks drawing on different networks recruit distinct units. 2️⃣ These units are causally linked to model behavior. Ablating the units critical for one domain impairs performance in that domain (−26% accuracy) but barely touches the others (−2.5%). This project has been in the works for a while :) Huge thanks to my advisors @jacobandreas @ev_fedorenko @devarda_a, and to @Nancy_Kanwisher for valuable conceptual input and feedback throughout. #MIT

368

364

24K

ethcat12 retweeted

ellington

@not_ellington

2 days ago

Probably 10x better than any of the eduslopppp bullshit you'll find in the 15 min threads with 2k bookmarks that have been put out in the past year truth be told. The best way to learn will always be to just sit down and read and reread and reread again https://t.co/Rz14zg9OdL

138

148K

Who to follow

ethcat12 retweeted

himanshu @retr0sushi_

13 days ago

looped transformer -> hyper-looped transformer -> looped world model ??

ethcat12 retweeted

deep Manifold

@BetaTomorrow

14 days ago

Sorry, I just saw this, @taskinfatih. Thank you @aimalysheva for asking... even though I might disappoint you. I may be missing part of the intended argument, but I would be cautious about interpreting this as evidence for a universal language manifold shared by humans and LLMs. “King” and “queen” are semantic objects observed at the physical-cover level. Their apparent vector relation does not necessarily reveal a pre-existing universal manifold underneath. Their representations are produced by an iterated integral through the network, with the input text acting as the boundary condition. A reasoning trace is therefore better understood as a computation-generated pathway through stacked representations, rather than simply a curve moving along a fixed low-dimensional semantic surface. The Linear Representation Hypothesis identifies locally consistent relations, such as the direction from “king” to “queen,” but this alone does not establish one universal language manifold. This may be why the idea is so difficult to understand: everything is dynamic. The input changes the boundary condition, each layer changes the representation, and the reasoning path is created during computation. We keep looking for one fixed manifold, but the geometry itself is moving. Hope this helps.... please see ** Single Token Geometry 06: Stacked Piecewise Manifold ** https://t.co/tWkscGX2F5

BetaTomorrow's tweet photo. Sorry, I just saw this, @taskinfatih. Thank you @aimalysheva for asking...

even though I might disappoint you.

I may be missing part of the intended argument, but I would be cautious about interpreting this as evidence for a universal language manifold shared by humans and LLMs.

“King” and “queen” are semantic objects observed at the physical-cover level. Their apparent vector relation does not necessarily reveal a pre-existing universal manifold underneath. Their representations are produced by an iterated integral through the network, with the input text acting as the boundary condition.

A reasoning trace is therefore better understood as a computation-generated pathway through stacked representations, rather than simply a curve moving along a fixed low-dimensional semantic surface. The Linear Representation Hypothesis identifies locally consistent relations, such as the direction from “king” to “queen,” but this alone does not establish one universal language manifold.

This may be why the idea is so difficult to understand: everything is dynamic. The input changes the boundary condition, each layer changes the representation, and the reasoning path is created during computation. We keep looking for one fixed manifold, but the geometry itself is moving.

Hope this helps....

please see ** Single Token Geometry 06: Stacked Piecewise Manifold **
https://t.co/tWkscGX2F5

844

ethcat12 retweeted

Sasha Malysheva

@aimalysheva

14 days ago

I'm fairly convinced there's some universal language manifold (= a surface formed by meaning vectors) that both humans and LLMs operate on. But we don't train LLMs to explicitly represent this manifold. We rather train them to approximate it, and to move along it by building curves on it. And those curves are reasoning in geometric terms, like a reasoning trace is a curve on a low-dimensional manifold embedded in a very high-dimensional space. The Linear Representation Hypothesis (https://t.co/2p3HZEGhX0) touches this, but I wonder if there's more recent work that takes the manifold idea further? Would love to see takes from people with serious differential geometry backgrounds on this!

104

562

442

39K

ethcat12 retweeted

Yoonho Lee

@yoonholeee

22 days ago

https://t.co/jCgH0doXCQ

410

601

119K

ethcat12 retweeted

Charlie O'Neill

@oneill_c

20 days ago

10/ There are a few rough edges: it's not lossless, iterative compaction isn't free extrapolation (training horizon matters a lot), and exact needle retrieval is still hard. But we've made many improvements to the architecture and training process since we wrote this paper, and are excited to share these soon. Regardless, the core result holds: amortization makes long-context compaction tractable. Huge shoutout to @alexsandomirsky, @part_harry_, @mudithj, @maxkirkby and the rest of the Baseten research team for their work on this! Paper: https://t.co/akNs8gAIFg

ethcat12 retweeted

Charlie O'Neill

@oneill_c

20 days ago

1/ You can shrink a language model's KV cache by 200×, in a single forward pass, and it still answers correctly. At 256k context that's 36 GiB of cache down to ~360 MiB, with no change to the base model. Here's how we did it 👇

oneill_c's tweet photo. 1/ You can shrink a language model's KV cache by 200×, in a single forward pass, and it still answers correctly.

At 256k context that's 36 GiB of cache down to ~360 MiB, with no change to the base model.

Here's how we did it 👇 https://t.co/He1ucvxGyf

976

102

957

113K

ethcat12 retweeted

elvis

@omarsar0

21 days ago

// Self-Harness: Harnesses That Improve Themselves // (bookmark this one) Most of the agent scaffolds we rely on today are built once and remain frozen or mostly unchanged. The harness, like the skills, needs to evolve with new models. What if the scaffold rewrites itself? This new work treats the harness, the prompts, tools, and control flow around the model as a learnable artifact that improves from its own runs rather than staying a fixed wrapper you hand-maintain. The scaffolding becomes the part that compounds, run after run. If you run long-horizon agents, a self-modifying harness turns scaffold upkeep from manual work into something the system earns on its own. Paper: https://t.co/byh1MP99xU Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. // Self-Harness: Harnesses That Improve Themselves //

(bookmark this one)

Most of the agent scaffolds we rely on today are built once and remain frozen or mostly unchanged.

The harness, like the skills, needs to evolve with new models.

What if the scaffold rewrites itself?

This new work treats the harness, the prompts, tools, and control flow around the model as a learnable artifact that improves from its own runs rather than staying a fixed wrapper you hand-maintain.

The scaffolding becomes the part that compounds, run after run. If you run long-horizon agents, a self-modifying harness turns scaffold upkeep from manual work into something the system earns on its own.

Paper: https://t.co/byh1MP99xU

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

537

101

712

32K

ethcat12 retweeted

Elon Litman

@elon_lit

21 days ago

Gradient descent on neural networks frequently drives the sharpest Hessian eigenvalue to exactly 2/learning_rate. This is the Edge of Stability. For five years, ML theory has failed to explain why this happens globally from any initialization. Until now. 🧵

elon_lit's tweet photo. Gradient descent on neural networks frequently drives the sharpest Hessian eigenvalue to exactly 2/learning_rate. This is the Edge of Stability. For five years, ML theory has failed to explain why this happens globally from any initialization. Until now. 🧵 https://t.co/y2E3FF2DdU

510

562

60K

ethcat12 retweeted

elie

@eliebakouch

28 days ago

WOW microsoft new "MAI Thinking 1" model comes with a 109 page tech report that looks REALLY detailed, this is amazing

985

118

675

201K

ethcat12 retweeted

Alberto Alfarano

@albe_alfa

29 days ago

Introducing Lattice Deduction Transformers: An 800k-parameter looped transformer that reasons like a SAT solver achieves 100% on Sudoku-Extreme with only 15 minutes of training. A collaboration between @axiommathai, @AmherstCollege and @BarnardCollege.

albe_alfa's tweet photo. Introducing Lattice Deduction Transformers: An 800k-parameter looped transformer that reasons like a SAT solver achieves 100% on Sudoku-Extreme with only 15 minutes of training.

A collaboration between @axiommathai, @AmherstCollege and @BarnardCollege. https://t.co/s0qpAxCLkW

175

226K

ethcat12 retweeted

机器之心 JIQIZHIXIN

@jiqizhixin

about 1 month ago

What if you could speed up AI image generation by 22x without retraining from scratch? Researchers from Zhejiang University and the University of Adelaide introduce FlashAR. They add a lightweight vertical prediction head to existing autoregressive models, enabling parallel two-way next-token prediction. This preserves the original training objective while dynamically combining horizontal and vertical predictions. Result: up to 22.9x speedup for 512x512 image generation, using just 0.05% of the original training data. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation Project: https://t.co/SX3AzDmHKS Paper: https://t.co/AAioG5WRic Code: https://t.co/baNSMUVYYD Our report: https://t.co/gBRu5ktZ4q 📬 #PapersAccepted by Jiqizhixin

jiqizhixin's tweet photo. What if you could speed up AI image generation by 22x without retraining from scratch?

Researchers from Zhejiang University and the University of Adelaide introduce FlashAR.

They add a lightweight vertical prediction head to existing autoregressive models, enabling parallel two-way next-token prediction. This preserves the original training objective while dynamically combining horizontal and vertical predictions.

Result: up to 22.9x speedup for 512x512 image generation, using just 0.05% of the original training data.

FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

Project: https://t.co/SX3AzDmHKS
Paper: https://t.co/AAioG5WRic
Code: https://t.co/baNSMUVYYD

Our report: https://t.co/gBRu5ktZ4q

📬 #PapersAccepted by Jiqizhixin

143

104

ethcat12 retweeted

Dan Kornas

@DanKornas

about 1 month ago

Training an LLM from scratch is easier to study when the whole path is in one repo. Train LLM From Scratch is a PyTorch repository for learning how a transformer language model is built, trained, saved, and used for text generation. It helps you move from “I understand attention on paper” to a runnable training pipeline by pairing model code with data download, preprocessing, config, training, and generation scripts. Key features: • Transformer components from scratch – separate PyTorch modules for MLP, attention, transformer blocks, and the final model • Pile-based data path – scripts download The Pile files and preprocess JSONL.ZST text into tokenized HDF5 datasets • Configurable training setup – model size, context length, heads, blocks, batch size, learning rate, and file paths live in https://t.co/zuPqaR3MhP • Hardware guidance – README compares common GPUs for 13M and 2B-class training runs • Generation workflow included – generate_text.py loads trained checkpoints and produces sample text outputs It’s open-source (MIT license). Link in the reply 👇

DanKornas's tweet photo. Training an LLM from scratch is easier to study when the whole path is in one repo.

Train LLM From Scratch is a PyTorch repository for learning how a transformer language model is built, trained, saved, and used for text generation.

It helps you move from “I understand attention on paper” to a runnable training pipeline by pairing model code with data download, preprocessing, config, training, and generation scripts.

Key features:

• Transformer components from scratch – separate PyTorch modules for MLP, attention, transformer blocks, and the final model
• Pile-based data path – scripts download The Pile files and preprocess JSONL.ZST text into tokenized HDF5 datasets
• Configurable training setup – model size, context length, heads, blocks, batch size, learning rate, and file paths live in https://t.co/zuPqaR3MhP
• Hardware guidance – README compares common GPUs for 13M and 2B-class training runs
• Generation workflow included – generate_text.py loads trained checkpoints and produces sample text outputs

It’s open-source (MIT license).

Link in the reply 👇

201

45K

ethcat12 retweeted

Grant Stenger (hiring)

@GrantStenger

about 1 month ago

Local minima are rare in high dimensions because a strict local minimum has to curve upward in every direction, so all Hessian eigenvalues must be positive. In a D-dimensional toy model where eigenvalue signs are independent, that’s a 2^(-D) event. In GOE-like random matrix models, positive definiteness is even rarer, roughly exp(-cD^2). So as dimension grows, random critical points are much more likely to be saddles than minima. This is one reason high-dimensional optimization is often a saddle-escape problem, not a bad-local-minimum problem. Wrote up some of the math here: https://t.co/vkaVqVD64N

191

305K

ethcat12 retweeted

DAIR.AI

@dair_ai

about 1 month ago

https://t.co/EU2rcF5M1f

862

120

149K

ethcat12 retweeted

elvis

@omarsar0

about 1 month ago

// Memory as Connectivity // One of the cleaner reframings of agent memory I have seen this month. FluxMem treats memory as the continuously evolving topology of a heterogeneous graph. Three stages run together: initial connection formation, feedback-driven refinement, and long-term consolidation of recurrent successful trajectories into reusable procedural circuits. During execution, it repairs missing links, prunes interference, and aligns abstraction granularity. SOTA on LoCoMo, Mind2Web, and GAIA across three distinct memory regimes. Paper: https://t.co/uNrdgGX4jC Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. // Memory as Connectivity //

One of the cleaner reframings of agent memory I have seen this month.

FluxMem treats memory as the continuously evolving topology of a heterogeneous graph.

Three stages run together: initial connection formation, feedback-driven refinement, and long-term consolidation of recurrent successful trajectories into reusable procedural circuits. During execution, it repairs missing links, prunes interference, and aligns abstraction granularity.

SOTA on LoCoMo, Mind2Web, and GAIA across three distinct memory regimes.

Paper: https://t.co/uNrdgGX4jC

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

443

413

25K

ethcat12 retweeted

Sakana AI

@SakanaAILabs

about 1 month ago

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation https://t.co/c9AvsRKybj What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: https://t.co/CRj96VGYQn GitHub: https://t.co/eNW0K9Xh8E 🐟

367

873K

Crazy Monkey

@ethcat12

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users