Diogo Fernandes @dioogfernands - Twitter Profile

about 21 hours ago

Linked Sasha's excellent video lecture (and tweet by Dwarkesh) at https://t.co/ghXnEtkCaK so more people can better understand how on-policy distillation works

NielsRogge's tweet photo. Linked Sasha's excellent video lecture (and tweet by Dwarkesh) at https://t.co/ghXnEtkCaK so more people can better understand how on-policy distillation works https://t.co/YQs5WiIE6l

1

167

15

146

12K

dioogfernands retweeted

Tom Dörr

@tom_doerr

2 days ago

Generates realistic handwriting from text inputs https://t.co/I6MB3RdfXk

2

137

19

120

8K

Diogo Fernandes @dioogfernands

1 day ago

RT @PyTorch: DeepSpeed now supports the Muon Optimizer. Optimized specifically for internal 2D weights within neural networks, Muon is gai…

0

2

0

1

dioogfernands retweeted

Niels Rogge @NielsRogge

3 days ago

What is mid-training? The stage between pre-training and post-training A base model is continued on a smaller, curated data mixture chosen to strengthen capabilities that the original pre-training run undercovered, such as multilinguality, domain knowledge, or long-context extension. It usually keeps a pre-training-like objective, but uses higher-quality or more targeted data so later instruction tuning, preference tuning, or RL can shape behavior on top of stronger capabilities. Learn more here: https://t.co/WhpYkyGlv8

NielsRogge's tweet photo. What is mid-training?

The stage between pre-training and post-training

A base model is continued on a smaller, curated data mixture chosen to strengthen capabilities that the original pre-training run undercovered, such as multilinguality, domain knowledge, or long-context extension.

It usually keeps a pre-training-like objective, but uses higher-quality or more targeted data so later instruction tuning, preference tuning, or RL can shape behavior on top of stronger capabilities.

Learn more here: https://t.co/WhpYkyGlv8

6

446

55

441

32K

Who to follow

Building Safe AI and Trusted Agentic Infrastructure, Entrepreneur, CTO@Luminary-AI

Hafidh Soekma Ardiansyah

@hafidhsoekma

Making Accessible Indonesia AI Model with @azale_ai 🥀 | Tech and Science Enthusiast 🧬

dioogfernands retweeted

Tilde

@tilderesearch

3 days ago

https://t.co/rmTk8GMkir

7

360

41

357

87K

dioogfernands retweeted

Mushtaq Bilal, PhD

@MushtaqBilalPhD

5 days ago

Do NOT use Sci-Hub, the evil website that pirated 88M+ research papers. It also integrates with Zotero. We must make billion-dollar, for-profit, academic publishers richer. Below is a step-by-step tutorial on how to add Sci-Hub to Zotero, so you know how to avoid it:

MushtaqBilalPhD's tweet photo. Do NOT use Sci-Hub, the evil website that pirated 88M+ research papers.

It also integrates with Zotero.

We must make billion-dollar, for-profit, academic publishers richer.

Below is a step-by-step tutorial on how to add Sci-Hub to Zotero, so you know how to avoid it: https://t.co/WaGJ49LpNy

14

847

181

1K

68K

dioogfernands retweeted

AI Conference DL Countdown @DlCountdown

6 days ago

BMVC'26 (paper): DL today, good luck (2h)! WACV'27-R1 (reg): 20 days. WACV'27-R1 (paper): 27 days. ACCV'26 (reg): 34 days. ACCV'26 (paper): 36 days. AAAI'27 (reg): 53 days. AAAI'27 (paper): 60 days. WACV'27-R2 (reg): 83 days. WACV'27-R2 (paper): 90 days. 3DV'27 (paper): 90 days.

0

29

2

7

3K

dioogfernands retweeted

MiniMax (official) @MiniMax_AI

4 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

538

9K

1K

3K

4M

dioogfernands retweeted

Chinmay

@ChinmayKak

5 days ago

long overdue website overhaul. check it out, link in the comments:)

10

389

17

358

19K

dioogfernands retweeted

Ronak Malde

@rronak_

6 days ago

🏹 Day 3 of the 5 Days of Trajectory! We are open sourcing a training stack for continual learning, in collaboration with SkyRL (@NovaSkyAI) and Anyscale (@anyscalecompute) At Trajectory, our mission is to bring the capability of continual learning to every team and company. Our contribution today is a multi-tenant, continual LoRA (C-LoRA) training stack that is built for workloads that are repeatedly spinning up and down. Links to get started below!

rronak_'s tweet photo. 🏹 Day 3 of the 5 Days of Trajectory!

We are open sourcing a training stack for continual learning, in collaboration with SkyRL (@NovaSkyAI) and Anyscale (@anyscalecompute)

At Trajectory, our mission is to bring the capability of continual learning to every team and company.

Our contribution today is a multi-tenant, continual LoRA (C-LoRA) training stack that is built for workloads that are repeatedly spinning up and down.

Links to get started below!

2

160

16

73

14K

dioogfernands retweeted

Trajectory

@trajectorylabs

6 days ago

🏹5 Days of Trajectory. Day 3 - An Open Source Training Stack for Continual Learning Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are highlighting today. Continual learning means models that improve hourly from real production use. But with the size of frontier models, this becomes quite difficult. A Qwen-397b would need to spin up and tear down repeatedly across six GPU nodes, and that's valuable time gone. Our contribution is Continual LoRA (C-LoRA): many lightweight adapters running at once on one shared base model. Our insight centers on where the parallelism lives: instead of splitting one giant job across nodes, we load-balance many small jobs over a single base. The result: 2.81x experiment throughput over single-tenant training, with no regression on rewards. We built this together, with @anyscalecompute, @NovaSkyAI, and generous support from @GoogleCloud and @GoogleStartups. We've open-sourced on SkyRL as one of the first multi-LoRA, RL training platforms, so that every team can get to continual learning faster. We’re very excited to see what you build, please reach out!

trajectorylabs's tweet photo. 🏹5 Days of Trajectory.

Day 3 - An Open Source Training Stack for Continual Learning

Building the platform for continual learning requires both partnering with pioneering AI companies, as we showed on Day 2 with Harvey, and working toward frontier research, which we are highlighting today.

Continual learning means models that improve hourly from real production use. But with the size of frontier models, this becomes quite difficult. A Qwen-397b would need to spin up and tear down repeatedly across six GPU nodes, and that's valuable time gone.

Our contribution is Continual LoRA (C-LoRA): many lightweight adapters running at once on one shared base model. Our insight centers on where the parallelism lives: instead of splitting one giant job across nodes, we load-balance many small jobs over a single base.

The result: 2.81x experiment throughput over single-tenant training, with no regression on rewards.

We built this together, with @anyscalecompute, @NovaSkyAI, and generous support from @GoogleCloud and @GoogleStartups. We've open-sourced on SkyRL as one of the first multi-LoRA, RL training platforms, so that every team can get to continual learning faster.

We’re very excited to see what you build, please reach out!

11

513

61

395

92K

dioogfernands retweeted

Dan Kornas

@DanKornas

6 days ago

Training an LLM from scratch is easier to study when the whole path is in one repo. Train LLM From Scratch is a PyTorch repository for learning how a transformer language model is built, trained, saved, and used for text generation. It helps you move from “I understand attention on paper” to a runnable training pipeline by pairing model code with data download, preprocessing, config, training, and generation scripts. Key features: • Transformer components from scratch – separate PyTorch modules for MLP, attention, transformer blocks, and the final model • Pile-based data path – scripts download The Pile files and preprocess JSONL.ZST text into tokenized HDF5 datasets • Configurable training setup – model size, context length, heads, blocks, batch size, learning rate, and file paths live in https://t.co/zuPqaR3MhP • Hardware guidance – README compares common GPUs for 13M and 2B-class training runs • Generation workflow included – generate_text.py loads trained checkpoints and produces sample text outputs It’s open-source (MIT license). Link in the reply 👇

DanKornas's tweet photo. Training an LLM from scratch is easier to study when the whole path is in one repo.

Train LLM From Scratch is a PyTorch repository for learning how a transformer language model is built, trained, saved, and used for text generation.

It helps you move from “I understand attention on paper” to a runnable training pipeline by pairing model code with data download, preprocessing, config, training, and generation scripts.

Key features:

• Transformer components from scratch – separate PyTorch modules for MLP, attention, transformer blocks, and the final model
• Pile-based data path – scripts download The Pile files and preprocess JSONL.ZST text into tokenized HDF5 datasets
• Configurable training setup – model size, context length, heads, blocks, batch size, learning rate, and file paths live in https://t.co/zuPqaR3MhP
• Hardware guidance – README compares common GPUs for 13M and 2B-class training runs
• Generation workflow included – generate_text.py loads trained checkpoints and produces sample text outputs

It’s open-source (MIT license).

Link in the reply 👇

16

1K

201

2K

44K

dioogfernands retweeted

Liquid AI

@liquidai

6 days ago

fine-tune LFM2.5-8B-A1B for your tasks and let it cook! 🧑‍🍳 https://t.co/0Ll3J21LTy

17

438

32

224

37K

dioogfernands retweeted

NVIDIA AI

@NVIDIAAI

7 days ago

Step 3.7 Flash is here ICYMI: 198B MoE with 11B active params, 256K context, native image + video support. Day 0 support is live on https://t.co/6T0R9P778k with GPU-accelerated endpoints, deploy with NVIDIA NIM inference microservices, and fine-tune with the NVIDIA NeMo framework. Congrats to the @stepfun_ai team!

NVIDIAAI's tweet photo. Step 3.7 Flash is here

ICYMI: 198B MoE with 11B active params, 256K context, native image + video support.

Day 0 support is live on https://t.co/6T0R9P778k with GPU-accelerated endpoints, deploy with NVIDIA NIM inference microservices, and fine-tune with the NVIDIA NeMo framework.

Congrats to the @stepfun_ai team!

18

486

47

120

45K

dioogfernands retweeted

NVIDIA AI

@NVIDIAAI

7 days ago

This is a great read on post-training and open models. @harvey & @trajectorylabs post-trained Nemotron 3 Super on complex legal tasks with some very impressive initial results. All with auditable weights, real security, and clear provenance.

16

252

37

105

30K

dioogfernands retweeted

Samuel Schmidgall

@SRSchmidgall

7 days ago

Our posting for joining Google DeepMind as a Research Scientist was down for a few days but now it is back up! Apply here: https://t.co/Yk5iMbMQPu And fill out this form: https://t.co/zdeqryH3hB

7

361

34

341

71K

dioogfernands retweeted

Lucas Maes

@lucasmaes_

8 days ago

Would you like to join the research effort on JEPA and World Models easily? After a full year of hard work, we’re excited to finally release stable-worldmodel: an open-source, scalable platform built to accelerate JEPA & World Model research! 📄: https://t.co/gnxGvens5A

lucasmaes_'s tweet photo. Would you like to join the research effort on JEPA and World Models easily?

After a full year of hard work, we’re excited to finally release stable-worldmodel:

an open-source, scalable platform built to accelerate JEPA & World Model research!

📄: https://t.co/gnxGvens5A

38

2K

270

2K

111K

dioogfernands retweeted

Liquid AI

@liquidai

8 days ago

Today, we're releasing LFM2.5-8B-A1B, a device-optimized model designed to power real-life applications on phones, laptops, PCs, robots, and fast & lightweight server-side use-cases. > 8B MoE, 1.5B active > Expanded 128K context > LFM2.5 flagship hybrid MoE architecture > Trained on 38T tokens + large-scale RL > fast, reliable tool calling, punching above its weight, comparable to models with up to 4x its size > customizable on a single GPU for any specialized task > LFM2 open-weight license 🧵

liquidai's tweet photo. Today, we're releasing LFM2.5-8B-A1B, a device-optimized model designed to power real-life applications on phones, laptops, PCs, robots, and fast & lightweight server-side use-cases.

> 8B MoE, 1.5B active
> Expanded 128K context
> LFM2.5 flagship hybrid MoE architecture
> Trained on 38T tokens + large-scale RL
> fast, reliable tool calling, punching above its weight, comparable to models with up to 4x its size
> customizable on a single GPU for any specialized task
> LFM2 open-weight license

🧵

138

4K

500

3K

1M

dioogfernands retweeted

Zhuang Liu

@liuzhuang1234

10 days ago

This is something I wanted to study long since I ever read the original knowledge distillation paper in 2015. Finally we've done it - with @TaiMingLu we thoroughly study the necessity of distillation from a "stronger" teacher

4

219

18

157

41K

dioogfernands retweeted

Yi Jing @yi_jing04

9 days ago

(1/6) Interpretability research is often accused of being insightful but not actionable. We ask a different question: can SAE representations directly guide LLM post-training data engineering? Paper: https://t.co/pVmDmS5Sg7 🧵👇

4

178

18

166

12K

Diogo Fernandes

@dioogfernands

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users