Steven Normore

21 days ago

Attention isn't all you need - some attention is. I've been exploring ways to reduce the compute and memory footprint of LLMs. I put it together in a repo: Calibrated Sparse Attention (CSA) is a one-time calibration pass on an existing model that figures out which keys each head in each layer actually uses, and you skip the rest. At matched LM quality I'm seeing potential for: ~10× longer context at the same memory + bandwidth budget ~10× less KV-read bandwidth per token (with a sparse kernel) ~10× fewer attention FLOPs ~10× more requests per GB of KV memory (composed with eviction) No retraining or labels needed.

1

4

1

0

218

snormore retweeted

Currently @heymantle, formerly @goclio @eveyevents @Shopify

22 days ago

Been on a quest to make models run with a smaller memory footprint, especially locally on my laptop, so I built a hybrid SSM+attention model with three memory layers: • minimal attention KV for turns • SSM state, MB-scale memory footprint for episodic • LoRA from captures for long-term learning Long context without the memory-bandwidth tax of GB KV caches, no prompt inflation. I'm curious to see how it scales to larger models. https://t.co/9ysdROTlBB

0

2

1

0

159

snormore retweeted

tobi lutke

@tobi

26 days ago

https://t.co/5MJ7u1VHwf

172

4K

518

8K

3M

Who to follow

Pete D.

@ThePeteD

Pro Cycling Fan, Cat4, Cinglés #11392, Motivational Speaker / Cyclist. ex-all-the-network@shopify. Creator of Custom Cycling Trips to France.

about 1 month ago

Let agents write code and deploy it in sandboxed runtimes, with cron, webhooks, or MCP as primitives: https://t.co/Kri5Jak8iG

about 1 month ago

"Make each program do one thing well" assumed humans were writing them. Slow, deliberate, by hand. Agents flip this. They can author small, single-purpose programs constantly, for one task, for one user, torn down when no longer useful. The philosophy scales in a way its authors couldn't have imagined. But durable work needs somewhere to live. Not a conversation context that evaporates. Not a workflow SaaS built for humans clicking through a GUI. cue is that runtime. Agents push code into it over MCP. Actions they author become callable — by a schedule, a webhook, an app the agent spun up, another agent. Each call runs in a fresh sandboxed VM, so scale doesn't mean blast radius. Here's how an agent OS starts.

2

3

0

380

0

2

0

244

snormore retweeted

DoubleZero @doublezero

about 1 month ago

On @Solana, blocks are broken into smaller packets called shreds. They're the earliest public representation of what's happening onchain. How those shreds travel determines who sees what, and when.

9

90

11

15

955K

about 2 months ago

@jnormore > locked in by an implementation decision they can barely remember making

0

2

0

78

snormore retweeted

about 2 months ago

My favourite part is the simulator https://t.co/XYMSpzM8b6

0

1

0

170

about 2 months ago

The control plane for your LLMs...

about 2 months ago

most teams have no view into what LLM calls they're actually making, or how cost breaks down across their business. they can't even try a different provider or model version, locked in by an implementation decision they can barely remember making, and the unknown of how each model behaves is too much. most companies are juggling thousands of API keys and Claude Max subs so employees can use AI. no way to route by use case, no idea which teams are using which models. i've been working on modelux to fix this. the control plane for your LLMs. https://t.co/IOLyhBbgIF

2

0

445

0

2

0

207

about 2 months ago

Shreds over multicast on DoubleZero. This was a fun project, and just getting started.

Malbec Labs

@MalbecLabs_xyz

about 2 months ago

https://t.co/RKtmlMgqu3

1

40

10

12

5K

5

11

2

1

2K

snormore retweeted

Malbec Labs

@MalbecLabs_xyz

about 2 months ago

Multicast allows the network to handle replication, rather than pushing that responsibility into application-layer systems. Data is transmitted once and delivered across a shared distribution path, reducing divergence in arrival time and improving consistency across receivers.

1

10

2

0

285

snormore retweeted

DoubleZero @doublezero

about 2 months ago

Market data over the internet works. Solana shreds over DoubleZero Edge wins. DoubleZero Edge beta is live. ↓

53

338

78

39

312K

snormore retweeted

about 2 months ago

You don’t always need a bigger LLM, just more diverse ones. So I built an ensemble inference proxy that sends prompts to multiple small models in parallel and combines their responses. Initial results look great! gpt-4.1-mini + haiku + qwen 3b (local): 74% accuracy. GPT-5 alone: 73%. Claude Sonnet: 74%. This ensemble config is 13x cheaper and 2.5x faster than GPT-5. And I haven’t even tested other providers yet. The trick: cross-provider diversity. Same-family ensembles do nothing. But models from different providers make different mistakes, and that's exploitable. Tested 27 configurations across 6 aggregation strategies. The best ensemble beats GPT-5 on knowledge tasks by 8 percentage points. Easy to experiment with your own configurations, just a YAML and emerge sweep. https://t.co/gbLun7Wbuq

1

4

2

0

189

snormore retweeted

about 2 months ago

Autonomous code/agent optimization: LLM proposes optimization ideas. Genetic algorithms evolve the best combinations. That's cEvolve. Benchmarks show ~60% faster convergence (before parallelization) and more likely to hit top-tier results. Inspired by @karpathy’s autoresearch. https://t.co/US78PwNP6P

2

6

1

0

282

2 months ago

@GuiBibeau @eden_ What's DB routing in this context? 😅

1

0

31

snormore retweeted