Bleys Goodson

Verified account

@bleysg

Helping people engineer the future.

SF / LA

Joined March 2009

559 Following

246 Followers

76 Posts

about 2 hours ago

When I ran it I had to deal with a lot of 429 errors from Minimax API. Likely they were seeing similar. If they didn't control for those that could drag the score down artificially. Also, Minimax just worked to roughly double their API tok/s, so the model should be able to complete more tasks without timing out and should get a better Avg time score now.

0

0

0

0

57

about 2 hours ago

@datacurve Every challenge is searchable with readable summaries at the end of the report as well.

bleysg's tweet photo. @datacurve Every challenge is searchable with readable summaries at the end of the report as well. https://t.co/h7RODHLVmI

0

0

0

1

34

about 3 hours ago

I got a lot of followup on my DeepSWE testing of Minimax M3 asking what it means to be fluent in this eval set. I dug into it. Full report covers breakdown by languages, task types, complexity, and more so you can see just how applicable it is to your type of work. https://t.co/HKOcOst4dQ

bleysg's tweet photo. I got a lot of followup on my DeepSWE testing of Minimax M3 asking what it means to be fluent in this eval set.

I dug into it.

Full report covers breakdown by languages, task types, complexity, and more so you can see just how applicable it is to your type of work.

https://t.co/HKOcOst4dQ

bleysg's tweet photo. I got a lot of followup on my DeepSWE testing of Minimax M3 asking what it means to be fluent in this eval set.

I dug into it.

Full report covers breakdown by languages, task types, complexity, and more so you can see just how applicable it is to your type of work.

https://t.co/HKOcOst4dQ

bleysg's tweet photo. I got a lot of followup on my DeepSWE testing of Minimax M3 asking what it means to be fluent in this eval set.

I dug into it.

Full report covers breakdown by languages, task types, complexity, and more so you can see just how applicable it is to your type of work.

https://t.co/HKOcOst4dQ

1

6

0

2

212

about 5 hours ago

@antirez @liuliu https://t.co/Yg34nrL0J4

0

0

0

0

30

Who to follow

personalityengineer

@personalityeng

AI Characters (embodied agents) and neuroscience. I've been around the metaverse a while. Also I use emacs and an HP35s. Where's my free pocket protector?

Michael Andregg

Verified account

@michaelandregg

ceo of eon | human emulation pbc https://t.co/M7nhgJxMlO prev: optical supercomputers/networking/robotics, high-speed mass production electron microscopy

Charlie O'Neill

Verified account

co-founder @parsedlabs, (briefly) dphil @UniofOxford, beating (tenderly training) models @baseten

about 5 hours ago

@antirez @liuliu The trick to make this fast is https://t.co/xKtXHdVUco

2

0

0

0

163

about 5 hours ago

@antirez @liuliu PowerInfer is the engine that keeps hot weights in RAM while TurboSparse’s learned dReLU activation sparsification reduces the amount of data that needs to stream from SSD by 10x.

0

0

0

0

45

about 5 hours ago

@melvynx FYI that Minimax M3 result is a forgery. There’s no official DeepSWE results and those numbers don’t align with my testing. https://t.co/6TvDWwIlkk

2 days ago

Since everyone is asking, I ran DeepSWE on MiniMax M3. Here is the lowdown. 15 of 113 passed! 19 if you count the 1.5x overtime I gave just to see. Full report: https://t.co/RglaGGablq

bleysg's tweet photo. Since everyone is asking, I ran DeepSWE on MiniMax M3.

Here is the lowdown. 15 of 113 passed!

19 if you count the 1.5x overtime I gave just to see.

Full report: https://t.co/RglaGGablq https://t.co/M97wHmPAzp

45

439

38

88

144K

0

0

0

0

190

1 day ago

@goodworse Wasn't it announced on stage that Gemini 3.5 Pro would ship in June? Odds should be higher.

0

0

0

0

152

1 day ago

@i_beltagy @allen_ai 🤣 it do be like that

0

0

0

0

16

1 day ago

@0xSero @MiniMax_AI @datacurve @theo I ran it basically the canonical way shown here https://t.co/trQNNCJkVS Using @Modal containers, same way they do with https://t.co/c4ROgpqkbr There are 113 test scenarios, each taking up to 90 minutes and requiring their own sandbox, so it is a demanding benchmark to run.

1

8

0

5

540

2 days ago

Since everyone is asking, I ran DeepSWE on MiniMax M3. Here is the lowdown. 15 of 113 passed! 19 if you count the 1.5x overtime I gave just to see. Full report: https://t.co/RglaGGablq

bleysg's tweet photo. Since everyone is asking, I ran DeepSWE on MiniMax M3.

Here is the lowdown. 15 of 113 passed!

19 if you count the 1.5x overtime I gave just to see.

Full report: https://t.co/RglaGGablq https://t.co/M97wHmPAzp

MiniMax (official) @MiniMax_AI

3 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

528

8K

1K

3K

3M

45

439

38

88

144K

2 days ago

@i_beltagy @allen_ai Also means I remember how much a desert it was in 2023 and how refreshing Olmo's discipline and openness is.

0

1

0

0

160

2 days ago

@ToNYD2WiLD @MiniMax_AI @datacurve @theo I was probably asking for too much before their unveiling next week, haha.

1

0

0

0

2K

2 days ago

@ariG23498 It's a common pitfall. NVIDIA calls it the "killer microsecond" problem. https://t.co/IiurSiqjNU

1

1

0

0

110

2 days ago

This is in line with my impressions after running it through DeepSWE too. This is the first in a new model series for them, so I think we will need to wait til the 3.5 release to really see what the inherent strength is to their architecture + post-training discipline. https://t.co/s5Co8db85v

0

4

0

1

605

2 days ago

@NVIDIAAI When you are flop-rich with GB200, Nemotron-style architectures better optimize for the hardware you have available. This may make Nemotron 3 Ultra the best blend of intelligence to tok/s to tok/megawatt of any model available this month.

0

3

0

0

517

2 days ago

There is a backstory to why @NVIDIAAI has stuck to 10% throughout the Nemotron 3 series, including the new 550B Ultra model, while most of the industry chases MoE with 3-5% activation. LatentMoE is that story. They argue effective MoEs be evaluated by two dimensions: accuracy per FLOP and accuracy per parameter. The race toward 3-5% activation implicitly optimizes only the first. https://t.co/GnT1t3XBoI

bleysg's tweet photo. There is a backstory to why @NVIDIAAI has stuck to 10% throughout the Nemotron 3 series, including the new 550B Ultra model, while most of the industry chases MoE with 3-5% activation.

LatentMoE is that story. They argue effective MoEs be evaluated by two dimensions: accuracy per FLOP and accuracy per parameter. The race toward 3-5% activation implicitly optimizes only the first.

https://t.co/GnT1t3XBoI

1

14

1

7

2K

2 days ago

This is also a story of optimizing your architecture to specific hardware which you understand intimately. Nemotron 3 Ultra is the culmination of targeting flop-rich GB200 Blackwell clusters with a goal of reaching peak intelligence which can push towards 300+ tok/s efficiently.

0

0

0

0

297

2 days ago

While most of the industry chases MoE with 3-5% activation, NVIDIA has stuck to 10% throughout the Nemotron 3 series, including the new 550B Ultra model. Will be keen to see what they settle on for the next series' sparsity.

2 days ago

nemotron 3 is significantly less sparse than other models (~10% active vs ~3% for kimi K2/deepseek v4)

eliebakouch's tweet photo. nemotron 3 is significantly less sparse than other models (~10% active vs ~3% for kimi K2/deepseek v4) https://t.co/lfYFKdVV42

15

186

9

39

29K

1

5

0

0

1K

2 days ago

@_chenglou Excellent! How much do you think this method can be generalized? E.g. for ARC-AGI-3 type problems.

0

0

0

0

18

Last Seen Users on Sotwe

Trends for you

Most Popular Users