Alessio Serra @aserra___ - Twitter Profile

Pinned Tweet

about 2 years ago

A dream is what you think about before you fall asleep in your bed. A project is what you think about in the morning when you wake up to plan your actions. Don’t follow your dreams, build projects.

6

81

2

4

6K

Alessio Serra @aserra___

about 7 hours ago

nice moe design approach: start from the serving bottlenecks in latency- and throughput-sensitive regimes, shrink the expert dimension, then reinvest the savings into larger top-k + more experts. better accuracy per flop. https://t.co/jwSF8p146I

aserra___'s tweet photo. nice moe design approach:
start from the serving bottlenecks in latency- and throughput-sensitive regimes, shrink the expert dimension, then reinvest the savings into larger top-k + more experts.

better accuracy per flop.

https://t.co/jwSF8p146I https://t.co/6MsOgxbJPY

0

47

Alessio Serra @aserra___

1 day ago

the discretization approach used here reminds me of CPU clock for synchronizing state updates every 200ms, audio/video/text streams are sliced into one timestep and fed to the model as the next interaction step i wonder how much you can transfer from computer architecture ideas

aserra___'s tweet photo. the discretization approach used here reminds me of CPU clock for synchronizing state updates

every 200ms, audio/video/text streams are sliced into one timestep and fed to the model as the next interaction step

i wonder how much you can transfer from computer architecture ideas https://t.co/NIfZKFhfvp

Thinking Machines

@thinkymachines

23 days ago

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. https://t.co/AFJZ5kH7Ku

461

16K

2K

12K

8M

0

2

0

91

Alessio Serra @aserra___

2 days ago

EMO gets semantic-level expert specialization by forcing tokens from the same document to route within a shared expert pool. cool result: for a domain/use case, you can keep only 25% of experts and get just a 1% absolute performance drop. https://t.co/T3STIO0Wja

aserra___'s tweet photo. EMO gets semantic-level expert specialization by forcing tokens from the same document to route within a shared expert pool.

cool result: for a domain/use case, you can keep only 25% of experts and get just a 1% absolute performance drop.

https://t.co/T3STIO0Wja https://t.co/YWZO9tC47A

0

1

50

aserra___ retweeted

MiniMax (official) @MiniMax_AI

3 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

527

8K

1K

3K

3M

Alessio Serra @aserra___

3 days ago

next level engineering rounding

0

31

Alessio Serra @aserra___

4 days ago

@swpnldubey he is ok track to pass the probation period

0

15

aserra___ retweeted

Tilde

@tilderesearch

5 days ago

~1/7~Introducing Parallax → a stronger attention variant that achieves a Pareto improvement over vanilla attention at 0.6B and 1.7B scales. Parallax has better perplexity, better downstream accuracy, and a decode kernel that matches or beats FlashAttention. 🧵

tilderesearch's tweet photo. ~1/7~Introducing Parallax → a stronger attention variant that achieves a Pareto improvement over vanilla attention at 0.6B and 1.7B scales.

Parallax has better perplexity, better downstream accuracy, and a decode kernel that matches or beats FlashAttention.

🧵 https://t.co/9MOf9QpTrl

7

508

64

423

89K

Alessio Serra @aserra___

5 days ago

@dwarkesh_sp non-verifiable domains do not exist, they are simply domains more difficult to verify

0

1

0

95

Alessio Serra @aserra___

5 days ago

llama may have made long context extension look easier than it is. OlmPool holds data + extension fixed and finds llama 3 extends far better than qwen-3 and olmo-3, mainly because it avoids qk norm and sliding window attention while pretraining at 8K. https://t.co/o4yZMw8BGC

aserra___'s tweet photo. llama may have made long context extension look easier than it is.

OlmPool holds data + extension fixed and finds llama 3 extends far better than qwen-3 and olmo-3, mainly because it avoids qk norm and sliding window attention while pretraining at 8K.

https://t.co/o4yZMw8BGC https://t.co/yiD0sEpuhr

0

63

Alessio Serra @aserra___

6 days ago

@MoFromYYZ i know an openai engineer who spends $1.3M per month.

Peter Steinberger 🦞

@steipete

19 days ago

The latest CodexBar update renders API costs wayyyy nicer. https://t.co/lJ4dxNHwzG

380

4K

185

2K

3M

1

4

1

0

1K

Alessio Serra @aserra___

6 days ago

@paulg The interesting thing is that colors probably disappeared because they were cheaper and easier to standardize. But now we consider black/grey/white elegant. Something colourful is the exception and sometimes a marketing move, e.g. the iPhone 17 Pro.

0

41

Alessio Serra @aserra___

6 days ago

@elonmusk so now only google and cohere are left using jax?

1

0

42

Alessio Serra @aserra___

6 days ago

@AnthropicAI @AltimeterCap @Greenoaks @sequoia we need a new term for the $1T pre-IPO valuation the second case is already within rounding distance

0

150

Alessio Serra @aserra___

6 days ago

we need a new term for the $1T pre-IPO valuation the second case is already within rounding distance

Anthropic

@AnthropicAI

6 days ago

We've raised $65 billion in Series H funding at a $965 billion post-money valuation, led by @AltimeterCap, Dragoneer, @Greenoaks, and @sequoia. This investment will help us advance our research and expand our capacity to meet growing demand for Claude.

1K

22K

2K

8M

0

2

0

82

Alessio Serra @aserra___

6 days ago

wonder how much data from mythos made it into this post-train opus 4.8’s big deltas are USAMO + GraphWalks, as was for mythos

Claude

@claudeai

6 days ago

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today at the same price.

claudeai's tweet photo. Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.

Available today at the same price. https://t.co/EufxL7T1kb

4K

67K

9K

8K

15M

0

60

Alessio Serra @aserra___

6 days ago

@himanshustwts margins are already good, there’s no need to increase prices until a new pretraining of a larger model

0

98

Alessio Serra @aserra___

7 days ago

interesting direction for memory consolidation at the cache eviction boundary before clearing a filled KV cache, the model runs N recurrent passes over accumulated context and updates persistent fast weights inside its SSM blocks https://t.co/awMF7rde2c

aserra___'s tweet photo. interesting direction for memory consolidation at the cache eviction boundary

before clearing a filled KV cache, the model runs N recurrent passes over accumulated context and updates persistent fast weights inside its SSM blocks

https://t.co/awMF7rde2c https://t.co/mgFXwKaJZq

0

1

0

52

Alessio Serra @aserra___

8 days ago

most impressive result is mimo-v2.5-pro from @Xiaomi beats deepseek-v4-pro despite being smaller (1.02T-A42B vs 1.6T-A49B)

Serena Ge (Datacurve)

@serenaa_ge

8 days ago

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

serenaa_ge's tweet photo. Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. https://t.co/HCDcjNuTFK

511

6K

753

3K

2M

1

0

200

Alessio Serra @aserra___

8 days ago

source:

Skyler Miao

@SkylerMiao7

8 days ago

Something BIG is coming

206

3K

343

1K

934K

0

52

Alessio Serra @aserra___

8 days ago

new sparse attention variant for 1M context from @MiniMax_AI it adds a tiny GQA index branch to pick top-k KV blocks, then runs attention on the original KV for those blocks. closer to CSA-style block routing than DSA, but without doing attention in compressed MLA space

aserra___'s tweet photo. new sparse attention variant for 1M context from @MiniMax_AI

it adds a tiny GQA index branch to pick top-k KV blocks, then runs attention on the original KV for those blocks.

closer to CSA-style block routing than DSA, but without doing attention in compressed MLA space https://t.co/ydbv6SHaDx

1

0

125

Alessio Serra

@aserra___

Last Seen Users on Sotwe

Trends for you

Most Popular Users