Austin Baggio @AustinBaggio - Twitter Profile

Austin Baggio

@AustinBaggio

4 days ago

@dcarps14 Thanks Dan! Would love to hear what you think.

0

1

0

16

Austin Baggio

@AustinBaggio

4 days ago

Ensue Research Lab now in early access. Most product teams that want a custom model never get one. Our swarm of agents fixes that. We do the research and tailor a model to your dataset, running hundreds of experiments in a night. Try it free: https://t.co/Sqpn9AJXux

2

3

2

0

200

Austin Baggio

@AustinBaggio

about 1 month ago

@art_zucker @deepseek_ai It's crazy how the time delay between open model SOTA and frontier is continuously shrinking

0

6

0

2K

AustinBaggio retweeted

Sai Vegasena

@svegas18

about 1 month ago

First DeepSeek V4-Flash-Base quant! https://t.co/BfKeATNXVI One of the @ensue_ai research agents worked (mostly) autonomously on 4H100s with 320GB of total VRAM in 80+ experiments. All quality and perf metrics are on The Hub!

0

6

5

2

902

Who to follow

Web3Plug (app/acc)

@plugrel

tools for stroke survivors 🧠⚕️@stroke_tech pay anyone 💸 @replydotcash fund anything 🫕 @POTLOCK_ ummaceleration 🕌 @ummahbuild ex ⋈ @NEARFoundation

frol.near

@Freol

Rust and NEAR Protocol, NEAR AI, IronClaw technical evangelist building @TrezuApp

Ben Kurrek

@ben_kurrek

Proudly 🇨🇦 | Chief Infinex Connector

Austin Baggio

@AustinBaggio

about 1 month ago

The velocity of improvements to open source models is incredible. Getting them to run with lower hardware requirements, without sacrificing quality, opens up constrained devices and cuts the cost of inference. Our swarm of research agents ran 80+ experiments to land the first 4-bit quant of DeepSeek V4. What model should we do next?

ensue

@ensue_ai

about 1 month ago

First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. https://t.co/Cuh60Yq0gn

0

12

3

2

3K

0

7

4

2

673

Austin Baggio

@AustinBaggio

about 2 months ago

Can I get an updated bear case on OS models, please? Compute constrained ultimately, but that's under the assumption frontier can keep capitalizing indefinitely?

DeepSeek

@deepseek_ai

about 2 months ago

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at https://t.co/GCdiMzk1Dl via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: https://t.co/drlDrxkYtp 🤗 Open Weights: https://t.co/T13Y8i7SDM 1/n

deepseek_ai's tweet photo. 🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length.

🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models.
🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice.

Try it now at https://t.co/GCdiMzk1Dl via Expert Mode / Instant Mode. API is updated & available today!

📄 Tech Report: https://t.co/drlDrxkYtp
🤗 Open Weights: https://t.co/T13Y8i7SDM

1/n

2K

46K

8K

10K

10M

0

1

0

99

Austin Baggio

@AustinBaggio

about 2 months ago

@julien_c I'll drive

0

1

0

27

Austin Baggio

@AustinBaggio

about 2 months ago

Breakthroughs are optional.

Christine Yip

@christinetyip

about 2 months ago

https://t.co/RKkOtdV10u

0

26

6

23

6K

1

4

2

707

Austin Baggio

@AustinBaggio

about 2 months ago

@pumatheuma Yeah my exact reaction too

0

177

AustinBaggio retweeted

Machine Learning (ML) Papers @Memoirs

about 2 months ago

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon Sai Vegasena https://t.co/ny5lBTH7Gy [𝚌𝚜.𝙻𝙶] 💬Code: https://t.co/6AOq3zZuNS

Memoirs's tweet photo. Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

Sai Vegasena
https://t.co/ny5lBTH7Gy [𝚌𝚜.𝙻𝙶]
💬Code: https://t.co/6AOq3zZuNS https://t.co/aSyAbsI65j

1

6

5

2

409

Austin Baggio

@AustinBaggio

about 2 months ago

@Memoirs Author @svegas18

0

1

0

41

AustinBaggio retweeted

Christine Yip

@christinetyip

about 2 months ago

Side-effect of doing research with an agent swarm: @svegas18 uncovered a subtle quantization failure mode while optimizing memory efficiency for 70B models. Full paper below.

0

5

3

710

Austin Baggio

@AustinBaggio

about 2 months ago

@omarsar0 @ClementDelangue That’s part of it certainly, but the search space is really important and agents are going to be increasingly good at defining the search space and knowing when to change it semi-autonomously

0

147

Austin Baggio

@AustinBaggio

about 2 months ago

@ClementDelangue @AvbNear Open. Source. Wins.

0

45

AustinBaggio retweeted

Sai Vegasena

@svegas18

about 2 months ago

ran llama 3.1 70B at 128K context on a 64GB Mac with turboquant - fused int4 attention kernel - no temp matrices, all registers - 48x faster than stock at long context - tested ~330 experiments to get here first paper from me + my agent lab @ensue_dev https://t.co/NRLazWN2xo gemma4 31B: https://t.co/uLB5lp2kFF llama3.1 70B: https://t.co/a8V7qadZIA https://t.co/nisp4v9ZnB

1

7

5

1

711

Austin Baggio

@AustinBaggio

about 2 months ago

Yesterday, Llama 3.1 70B at 128K context on a single 64GB Mac wasn't possible. Today it is. KV cache compressed from 40GB to 12.5GB. 48x faster than the standard dequantize-then-attend path. Ensue Research just dropped its first paper. Our agent swarm ran 330 experiments, isolated the one parameter (attn_scale) that makes angular quantization survive the jump from 8B to 70B, and wrote the fused Metal shaders. Breakthroughs are now optional.

ensue

@ensue_ai

about 2 months ago

Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked: - 48x faster attention at 128K context - Llama 3.1 70B at full 128K on a single 64GB Mac Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon. Full paper + write-up + implementation ↓

ensue_ai's tweet photo. Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked:

- 48x faster attention at 128K context
- Llama 3.1 70B at full 128K on a single 64GB Mac

Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon.

Full paper + write-up + implementation ↓

5

36

12

32

5K

2

15

7

2

859

Austin Baggio

@AustinBaggio

about 2 months ago

Why does editing an agent's soul.md feel so invasive

1

0

66

Austin Baggio

@AustinBaggio

about 2 months ago

@ClementDelangue Do you look for a metric when you compare harnesses? We've been noticing really good results optimizing kernels for specific hardware, assuming you care about token throughput?

0

290

AustinBaggio retweeted

Chester

@chesterzelaya

about 2 months ago

the male equivalent to flowers is probably an RTX6000 Pro Blackwell Workstation

70

4K

433

180

123K

Austin Baggio

@AustinBaggio

about 2 months ago

What's incredible is the breadth of discovery that the agents uncover. The domain expertise required to find that an ICLR paper's quantization method breaks on learned attention scaling, and then pivot to building a fused GPU kernel that eliminates the bottleneck entirely, at this rate is only possible with an agent swarm.

Sai Vegasena

@svegas18

about 2 months ago

My research agents Implemented @GoogleDeepMind's TurboQuant (https://t.co/dH5cSEzGuO) — full PolarQuant, QJL, 10 Metal compute shaders, the whole paper for Gemma 4 31B on a single 64GB 2021 MacBook Pro. Turns out it doesn't work on this architecture ... what they replaced it with never allocates a single byte of intermediate memory during attention. 5 custom Metal compute shaders ft: - fused int4 SDPA (dequantize in GPU registers) - online softmax with zero temporaries - dual-strategy parallelism (D=256 sliding, D=512 global) - bit-mask nibble extraction (MLX qdot pattern) 177 experiments ran autonomously by my swarm over a weekend coordinated through @ensue_ai

svegas18's tweet photo. My research agents Implemented @GoogleDeepMind's TurboQuant (https://t.co/dH5cSEzGuO)
— full PolarQuant, QJL, 10 Metal compute shaders, the whole paper
for Gemma 4 31B on a single 64GB 2021 MacBook Pro. Turns out it doesn't work on this architecture ...

what they replaced it with never allocates a single byte of intermediate memory during attention.

5 custom Metal compute shaders ft:
- fused int4 SDPA (dequantize in GPU registers)
- online softmax with zero temporaries
- dual-strategy parallelism (D=256 sliding, D=512 global)
- bit-mask nibble extraction (MLX qdot pattern)

177 experiments ran autonomously by my swarm over a weekend coordinated through @ensue_ai

1

9

4

2

717

0

3

1

0

179

Austin Baggio

@AustinBaggio

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users