Ahmed Morsi

@eramax

Joined November 2023

2.3K Following

60 Followers

1.5K Posts

eramax retweeted

Rosinality @rosinality

about 15 hours ago

https://t.co/bsAl3FGITv Why does power-law scaling occur? Loss of individual tokens follows a sigmoidal curve, and the aggregation of these curves with different times of learning makes a power-law curve. Quanta hypothesis again?

rosinality's tweet photo. https://t.co/bsAl3FGITv

Why does power-law scaling occur? Loss of individual tokens follows a sigmoidal curve, and the aggregation of these curves with different times of learning makes a power-law curve. Quanta hypothesis again? https://t.co/Iw2qn3m6QE

eramax retweeted

Xiaomi MiMo

@XiaomiMiMo

about 17 hours ago

Open-weights are taking over — proud to see more developers building on MiMo. 👏 @cline

232

24K

eramax retweeted

Tongyi Lab @Ali_TongyiLab

about 13 hours ago

We're introducing our latest research paper HydraHead, a new attention hybridization architecture that fuses Full Attention and Linear Attention at the head level. Motivated by insights from mechanistic interpretability, HydraHead treats the attention head—not the layer—as the natural granularity for attention hybridization to build more efficient long-context models. A short thread 🧵

Ali_TongyiLab's tweet photo. We're introducing our latest research paper HydraHead, a new attention hybridization architecture that fuses Full Attention and Linear Attention at the head level.

Motivated by insights from mechanistic interpretability, HydraHead treats the attention head—not the layer—as the natural granularity for attention hybridization to build more efficient long-context models.

A short thread 🧵

369

249

189K

eramax retweeted

Sebastian Raschka

@rasbt

about 11 hours ago

After 18 months of writing, coding, and experimenting, Build a Reasoning Model (From Scratch) is finally out! My first copies just arrived! 📚 440 full-color pages. Inference scaling, reinforcement learning, and distillation from scratch.

rasbt's tweet photo. After 18 months of writing, coding, and experimenting, Build a Reasoning Model (From Scratch) is
finally out!

My first copies just arrived! 📚

440 full-color pages. Inference scaling, reinforcement learning, and distillation from scratch. https://t.co/647ksI7sLc

202

369

264K

eramax retweeted

WeZZard

@realWeZZard

about 17 hours ago

I gave GLM 5.2 eyes.

eramax retweeted

ModelScope

@ModelScope2022

about 13 hours ago

Introducing Agents-A1, A 35B MoE agentic model built for long-horizon tasks across search, engineering, scientific research, instruction following, and tool calling. 🤖 https://t.co/Xkm6OJ3Ivf 📚 256K context length + 🧠 Agentic reasoning 🏆 Reaches SOTA results on long-horizon search, scientific research, and instruction-following benchmarks, with competitive results among 35B-class models. 🛠️Supports function calling and tool integration, enabling interaction with APIs, code interpreters, search engines, and other external tools.

ModelScope2022's tweet photo. Introducing Agents-A1, A 35B MoE agentic model built for long-horizon tasks across search, engineering, scientific research, instruction following, and tool calling.
🤖 https://t.co/Xkm6OJ3Ivf

📚 256K context length + 🧠 Agentic reasoning

🏆 Reaches SOTA results on long-horizon search, scientific research, and instruction-following benchmarks, with competitive results among 35B-class models.

🛠️Supports function calling and tool integration, enabling interaction with APIs, code interpreters, search engines, and other external tools.

821

716

110K

eramax retweeted

Alex Prompter

@alex_prompter

1 day ago

Harvard, Andrew Ng, and Karpathy will teach you AI engineering for free. Most people just do it in the wrong order: Almost all of it is free, and the order matters as much as the resources. 1. Start with Python. It's the language the AI field runs on, and Harvard's CS50P teaches it better than most paid bootcamps. 2. Once the basics click, learn how Python is used in AI. Andrew Ng's "AI Python for Beginners" is a free four-part course that bridges writing code and building with models. 3. From there, get a feel for how LLMs work under the hood. 3Blue1Brown's visual explainers make transformers and attention click. 4. When you want to go deeper, build a small model yourself. Andrej Karpathy's "Zero to Hero" series takes you from one neuron to a working model, line by line. 5. Next, learn how AI agents actually work. Anthropic's "Building Effective Agents" is the most grounded guide, and its lesson is to use composable patterns, not heavy frameworks. 6. For hands-on practice, take the CrewAI short course. It teaches you to treat agents like a team of people working together. 7. After that, connect your agents to the real world. That's what MCP does, wiring models to tools, APIs, and databases, and the official docs are the cleanest place to start. 8. Now build real projects. The open-source ai-engineering-hub repo has dozens of working examples across LLMs, RAG, and agents you can adapt into your own work. 9. Finally, read one book instead of ten. Chip Huyen's "AI Engineering" covers what you need to ship real applications. The throughline is simple. Frameworks come and go, so don't build your skills around them. Master the fundamentals once, and everything on top gets easier, and you'll stay ahead of the people chasing the framework of the week.

alex_prompter's tweet photo. Harvard, Andrew Ng, and Karpathy will teach you AI engineering for free. Most people just do it in the wrong order:

Almost all of it is free, and the order matters as much as the resources.

1. Start with Python. It's the language the AI field runs on, and Harvard's CS50P teaches it better than most paid bootcamps.

2. Once the basics click, learn how Python is used in AI. Andrew Ng's "AI Python for Beginners" is a free four-part course that bridges writing code and building with models.

3. From there, get a feel for how LLMs work under the hood. 3Blue1Brown's visual explainers make transformers and attention click.

4. When you want to go deeper, build a small model yourself. Andrej Karpathy's "Zero to Hero" series takes you from one neuron to a working model, line by line.

5. Next, learn how AI agents actually work. Anthropic's "Building Effective Agents" is the most grounded guide, and its lesson is to use composable patterns, not heavy frameworks.

6. For hands-on practice, take the CrewAI short course. It teaches you to treat agents like a team of people working together.

7. After that, connect your agents to the real world. That's what MCP does, wiring models to tools, APIs, and databases, and the official docs are the cleanest place to start.

8. Now build real projects. The open-source ai-engineering-hub repo has dozens of working examples across LLMs, RAG, and agents you can adapt into your own work.

9. Finally, read one book instead of ten. Chip Huyen's "AI Engineering" covers what you need to ship real applications.

The throughline is simple. Frameworks come and go, so don't build your skills around them. Master the fundamentals once, and everything on top gets easier, and you'll stay ahead of the people chasing the framework of the week.

883

173

45K

eramax retweeted

Grigory Sapunov

@che_shr_cat

1 day ago

1/ A 5M-parameter model just beat frontier LLMs on hard logical puzzles at less than 1/100,000th of the inference cost. How? By scaling test-time compute in continuous latent space rather than discrete token space. Let's unpack how this works. 🧵

che_shr_cat's tweet photo. 1/
A 5M-parameter model just beat frontier LLMs on hard logical puzzles at less than 1/100,000th of the inference cost.

How? By scaling test-time compute in continuous latent space rather than discrete token space.

Let's unpack how this works. 🧵 https://t.co/fYZIyEFqpI

661

670

27K

eramax retweeted

mohit

@mohitwt_

1 day ago

Deepseek's DSpark compared with DFlash and EAGLE-3 on MI300X each method uses a ~3B parameter draft head that proposes 7 tokens at a time, which the 14B target model (Qwen) verifies in parallel. tested on three real world workloads: - HumanEval (164 code problems) - MT-Bench (80 multi-turn conversations) - AIME 2025 (30 competition math problems). DFlash uses a lightweight block based MLP draft head, it proposes tokens in fixed size blocks and relies on the draft head's speed to keep the target model fed. minimal overhead per proposal, but no mechanism to stop early when the draft is uncertain. EAGLE-3 uses a TTT (Test Time Training) based draft head. it's more sophisticated in how it models token distributions, but the TTT mechanism adds computational cost per proposal. the tradeoff, potentially better acceptance rates in exchange for heavier per-iteration work. DSpark also uses a block based draft head like DFlash, but adds a confidence head that predicts which tokens the draft is uncertain about. this enables early stopping if the confidence head detects low confidence, verification stops early, saving wasted compute on tokens that would likely be rejected anyway. DSpark dominates every metric on every dataset. it achieves the highest acceptance length (avg 4.65 tokens vs 3.89 for DFlash and 3.50 for EAGLE-3), translates directly into the fastest throughput (avg 127 tok/s vs 111 for DFlash and 81 for EAGLE-3), and completes all benchmarks in the least wall time. the confidence head's early stopping mechanism is the key differentiator, it avoids wasting target model compute on low-confidence tokens, which compounds into significant time savings across hundreds of proposals per benchmark. the improvement over DFlash is notable: +20% acceptance length +14% throughput DSpark also achieves the highest verify rates (67% on HumanEval, 62% on AIME25, 46% on MT-Bench), confirming that early stopping doesn't sacrifice quality, it just avoids wasting compute on tokens that would be rejected anyway. the heavier per-proposal computation from the TTT mechanism appears to outweigh any gains in draft quality, especially on a single GPU where compute efficiency matters more than draft accuracy. on multi-GPU setups with more compute headroom, EAGLE-3 might be closer, but for single GPU inference, DSpark's lightweight + early stopping approach is clearly superior.

mohitwt_'s tweet photo. Deepseek's DSpark compared with DFlash and EAGLE-3 on MI300X

each method uses a ~3B parameter draft head that proposes 7 tokens at a time, which the 14B target model (Qwen) verifies in parallel. tested on three real world workloads:
- HumanEval (164 code problems)
- MT-Bench (80 multi-turn conversations)
- AIME 2025 (30 competition math problems).

DFlash uses a lightweight block based MLP draft head, it proposes tokens in fixed size blocks and relies on the draft head's speed to keep the target model fed. minimal overhead per proposal, but no mechanism to stop early when the draft is uncertain.

EAGLE-3 uses a TTT (Test Time Training) based draft head. it's more sophisticated in how it models token distributions, but the TTT mechanism adds computational cost per proposal. the tradeoff, potentially better acceptance rates in exchange for heavier per-iteration work.

DSpark also uses a block based draft head like DFlash, but adds a confidence head that predicts which tokens the draft is uncertain about. this enables early stopping if the confidence head detects low confidence, verification stops early, saving wasted compute on tokens that would likely be rejected anyway.

DSpark dominates every metric on every dataset. it achieves the highest acceptance length (avg 4.65 tokens vs 3.89 for DFlash and 3.50 for EAGLE-3), translates directly into the fastest throughput (avg 127 tok/s vs 111 for DFlash and 81 for EAGLE-3), and completes all benchmarks in the least wall time.

the confidence head's early stopping mechanism is the key differentiator, it avoids wasting target model compute on low-confidence tokens, which compounds into significant time savings across hundreds of proposals per benchmark.

the improvement over DFlash is notable:
+20% acceptance length
+14% throughput

DSpark also achieves the highest verify rates (67% on HumanEval, 62% on AIME25, 46% on MT-Bench), confirming that early stopping doesn't sacrifice quality, it just avoids wasting compute on tokens that would be rejected anyway.

the heavier per-proposal computation from the TTT mechanism appears to outweigh any gains in draft quality, especially on a single GPU where compute efficiency matters more than draft accuracy.

on multi-GPU setups with more compute headroom, EAGLE-3 might be closer, but for single GPU inference, DSpark's lightweight + early stopping approach is clearly superior.

16K

eramax retweeted

Meituan LongCat

@Meituan_LongCat

about 22 hours ago

Introducing LongCat-2.0 🐱 1.6T parameters · MoE with ~48B active · 1M context The full model behind Owl Alpha on @OpenRouter — now available. Built for agentic coding from the ground up: ◆ LongCat Sparse Attention (LSA) — scales efficiently for 1M-context tokens ◆ Zero-Compute Experts — dynamic activation 33B–56B per token, zero wasted compute ◆ MOPD — three specialized expert groups (Agent / Reasoning / Interaction), gate-routed per task How it stacks up: → Terminal-Bench 2.1: 70.8 → SWE-bench Pro: 59.5 (GPT-5.5: 58.6) → SWE-bench Multilingual: 77.3 → FORTE: 73.2 · RWSearch: 78.8 · BrowseComp: 79.9 📖 Tech Blog: https://t.co/4KrjyKiDBn Try it across different scenarios 🧵👇

Meituan_LongCat's tweet photo. Introducing LongCat-2.0 🐱
1.6T parameters · MoE with ~48B active · 1M context
The full model behind Owl Alpha on @OpenRouter — now available.

Built for agentic coding from the ground up:
◆ LongCat Sparse Attention (LSA) — scales efficiently for 1M-context tokens
◆ Zero-Compute Experts — dynamic activation 33B–56B per token, zero wasted compute
◆ MOPD — three specialized expert groups (Agent / Reasoning / Interaction), gate-routed per task

How it stacks up:
→ Terminal-Bench 2.1: 70.8
→ SWE-bench Pro: 59.5 (GPT-5.5: 58.6)
→ SWE-bench Multilingual: 77.3
→ FORTE: 73.2 · RWSearch: 78.8 · BrowseComp: 79.9

📖 Tech Blog: https://t.co/4KrjyKiDBn
Try it across different scenarios 🧵👇

188

372

947K

eramax retweeted

徐伦

@Jolyne_AI

2 days ago

GitHub 开源 CUDA 系统教程：LeetCUDA（从入门到进阶，一站打通） 200+ 个循序渐进的 CUDA Kernel 实战题，配套 HGEMM 库性能可达 cuBLAS 的 98%～100%。另有 100+ 篇高性能计算技术博客，专讲关键技巧与优化方法，帮你从“能写”进阶到“写得又快又稳”。 GitHub： https://t.co/7NPsVWuRf1 面向初学者精心设计，结合 PyTorch 梳理清晰路线：写对 → 写快 → 逼近库级性能。适合系统掌握 CUDA 的开发者，也适合作为大模型推理优化的 AI 工程师参考与进阶路径。

Jolyne_AI's tweet photo. GitHub 开源 CUDA 系统教程：LeetCUDA（从入门到进阶，一站打通）

200+ 个循序渐进的 CUDA Kernel 实战题，配套 HGEMM 库性能可达 cuBLAS 的 98%～100%。另有 100+ 篇高性能计算技术博客，专讲关键技巧与优化方法，帮你从“能写”进阶到“写得又快又稳”。

GitHub： https://t.co/7NPsVWuRf1

面向初学者精心设计，结合 PyTorch 梳理清晰路线：写对 → 写快 → 逼近库级性能。

适合系统掌握 CUDA 的开发者，也适合作为大模型推理优化的 AI 工程师参考与进阶路径。

405

407

13K

Ahmed Morsi @eramax

1 day ago

@mehulmpt can u recommend it over opencode go subscription and quota limits ?

208

eramax retweeted

Cline

@cline

1 day ago

We’ve been impressed with GLM-5.2 and so are introducing a $9.99/month subscription to give you 2-5x discounted access to it and other open weight models like DeepSeek, Kimi, MiniMax, Mimo, Qwen. Use it on Cline CLI & IDE with $1.99 special promo if sign up via: npm i -g cline

333

252

972K

eramax retweeted

Hugh Zhang

@hughbzhang

1 day ago

A question I’ve been pondering: what if we'd known about o1 / RL on chain-of-thought back in the early days of LLMs? It turns out SFT + a bit of RL on GPT-2 almost matches the performance of a fine-tuned GPT-3 (12b) on GSM8K — a model with >100x the pre-training compute.

hughbzhang's tweet photo. A question I’ve been pondering: what if we'd known about o1 / RL on chain-of-thought back in the early days of LLMs?

It turns out SFT + a bit of RL on GPT-2 almost matches the performance of a fine-tuned GPT-3 (12b) on GSM8K — a model with >100x the pre-training compute. https://t.co/9vcD1GxiPi

510

215

50K

eramax retweeted

Dan Kornas

@DanKornas

2 days ago

🔗 GitHub: https://t.co/m3ZyrCZfPZ --- ✉️ If you’re into AI, ML, agents, and building real systems, join my newsletter (it’s free): https://t.co/zJ9uwd6qSd

eramax retweeted

Ronin

@DeRonin_

1 day ago

My entire AI stack is now Chinese 🇨🇳 87% cheaper. same revenue swaps by task: 1. reasoning / backend brain Opus 4.8 → Kimi K2.7 benchmark gap: ~8% · price: ~11x cheaper 2. code generation GPT-5.5 → Qwen 3.7 Max benchmark gap: ~18% · price: ~7x cheaper 3. agent loops + tool calling Sonnet 4.7 → GLM 5.2 benchmark gap: ~3% · price: ~5x cheaper on input 4. cheap volume / bulk processing GPT-5.5 mini → MiMo V2.5 benchmark gap: ~6% · price: ~12x cheaper 5. image generation GPT-Image-2 → Wan 2.5 benchmark gap: ~5% · price: ~8x cheaper 6. video generation Sora 2 → Kling 3.0 benchmark gap: roughly equal · price: ~6x cheaper [ result after 30 days: ] operating costs dropped 87%, output quality dropped 4% on average, revenue unchanged the most important that these models will be not banned in a month and i can run them locally nobody will steal my data and i can learn them as i need full article drops tomorrow with: > exact routing logic per task type > the 2 cases where I still pay for American > the migration playbook anyone can copy in a weekend VERY IMPORTANT to get migrated now, while it's not too late

DeRonin_'s tweet photo. My entire AI stack is now Chinese 🇨🇳

87% cheaper. same revenue

swaps by task:

1. reasoning / backend brain
Opus 4.8 → Kimi K2.7
benchmark gap: ~8% · price: ~11x cheaper

2. code generation
GPT-5.5 → Qwen 3.7 Max
benchmark gap: ~18% · price: ~7x cheaper

3. agent loops + tool calling
Sonnet 4.7 → GLM 5.2
benchmark gap: ~3% · price: ~5x cheaper on input

4. cheap volume / bulk processing
GPT-5.5 mini → MiMo V2.5
benchmark gap: ~6% · price: ~12x cheaper

5. image generation
GPT-Image-2 → Wan 2.5
benchmark gap: ~5% · price: ~8x cheaper

6. video generation
Sora 2 → Kling 3.0
benchmark gap: roughly equal · price: ~6x cheaper

[ result after 30 days: ]

operating costs dropped 87%, output quality dropped 4% on average, revenue unchanged

the most important that these models will be not banned in a month and i can run them locally

nobody will steal my data and i can learn them as i need

full article drops tomorrow with:
> exact routing logic per task type
> the 2 cases where I still pay for American
> the migration playbook anyone can copy in a weekend

VERY IMPORTANT to get migrated now, while it's not too late

433

930

499K

eramax retweeted

Joel - coffee/acc

@JoelDeTeves

2 days ago

OKAY - it seemed like DFlash would be the clear winner. But it appears there have been some improvements with MTP. With MTP + split-mode = tensor, Qwen3.6-27B gets over 120 tokens/second on dual RTX 3090s (note I am running PCIE x16 on both, I don't have an NVLink bridge). This is absolutely insane - stock this model gets 20 - 30 tokens/second on this setup. That is over a 6x increase in speed. But that's not the end of it - if GGML adds tensor splitting support for DFlash, DFlash could pull ahead once more. Absolutely wild to see what's happening in the local model space right now. This is the most exciting time to own GPUs! Here is the "golden" setup: [Ornstein3.6-27B-MTP-NSC-ACE-SABER] @DJLougen n-gpu-layers = 999 flash-attn = on cont-batching = true jinja = true no-mmap = true split-mode = tensor fit = off ctx-size = 131072 cache-type-k = q8_0 cache-type-v = q8_0 spec-type = draft-mtp spec-draft-n-max = 8 temp = 0.7 top-p = 0.95 top-k = 20 chat-template-kwargs = {"preserve_thinking": true} Wow!!!

$JoelDeTeves's tweet photo. OKAY - it seemed like DFlash would be the clear winner. But it appears there have been some improvements with MTP. With MTP + split-mode = tensor, Qwen3.6-27B gets over 120 tokens/second on dual RTX 3090s (note I am running PCIE x16 on both, I don't have an NVLink bridge). This is absolutely insane - stock this model gets 20 - 30 tokens/second on this setup. That is over a 6x increase in speed. But that's not the end of it - if GGML adds tensor splitting support for DFlash, DFlash could pull ahead once more. Absolutely wild to see what's happening in the local model space right now. This is the most exciting time to own GPUs! Here is the "golden" setup: [Ornstein3.6-27B-MTP-NSC-ACE-SABER] @DJLougen n-gpu-layers = 999 flash-attn = on cont-batching = true jinja = true no-mmap = true split-mode = tensor fit = off ctx-size = 131072 cache-type-k = q8_0 cache-type-v = q8_0 spec-type = draft-mtp spec-draft-n-max = 8 temp = 0.7 top-p = 0.95 top-k = 20 chat-template-kwargs = {"preserve_thinking": true} Wow!!!$

eramax retweeted

Grigory Sapunov

@che_shr_cat

2 days ago

1/ We have been treating GPU memory all wrong. What if the GPU didn't need to store your model at all? MegaTrain enables full-precision training of 100B+ LLMs on a single GPU by turning VRAM into a transient, stateless cache. The secret? Inverting the memory hierarchy. 🧵

che_shr_cat's tweet photo. 1/
We have been treating GPU memory all wrong.

What if the GPU didn't need to store your model at all?

MegaTrain enables full-precision training of 100B+ LLMs on a single GPU by turning VRAM into a transient, stateless cache.

The secret? Inverting the memory hierarchy. 🧵 https://t.co/CXJVbW2By3

136

78K

eramax retweeted

SaltyAom

@saltyAom

2 days ago

Elysia compilation time to get faster from 18.4x to 55.8x This means Elysia will get much better in Serverless environment like Cloudflare Worker, AWS Lambda, and much more

saltyAom's tweet photo. Elysia compilation time to get faster from 18.4x to 55.8x

This means Elysia will get much better in Serverless environment like Cloudflare Worker, AWS Lambda, and much more https://t.co/YinlkfQtHG

399

14K

Ahmed Morsi @eramax

1 day ago

@asmah2107 Thanks for this post

Ahmed Morsi

@eramax

Last Seen Users on Sotwe

Trends for you

Most Popular Users