When I ran it I had to deal with a lot of 429 errors from Minimax API. Likely they were seeing similar. If they didn't control for those that could drag the score down artificially. Also, Minimax just worked to roughly double their API tok/s, so the model should be able to complete more tasks without timing out and should get a better Avg time score now.
I got a lot of followup on my DeepSWE testing of Minimax M3 asking what it means to be fluent in this eval set.
I dug into it.
Full report covers breakdown by languages, task types, complexity, and more so you can see just how applicable it is to your type of work.
https://t.co/HKOcOst4dQ
@antirez@liuliu PowerInfer is the engine that keeps hot weights in RAM while TurboSparse’s learned dReLU activation sparsification reduces the amount of data that needs to stream from SSD by 10x.
@melvynx FYI that Minimax M3 result is a forgery. There’s no official DeepSWE results and those numbers don’t align with my testing.
https://t.co/6TvDWwIlkk
Since everyone is asking, I ran DeepSWE on MiniMax M3.
Here is the lowdown. 15 of 113 passed!
19 if you count the 1.5x overtime I gave just to see.
Full report: https://t.co/RglaGGablq
@0xSero@MiniMax_AI@datacurve@theo I ran it basically the canonical way shown here https://t.co/trQNNCJkVS
Using @Modal containers, same way they do with https://t.co/c4ROgpqkbr
There are 113 test scenarios, each taking up to 90 minutes and requiring their own sandbox, so it is a demanding benchmark to run.
Since everyone is asking, I ran DeepSWE on MiniMax M3.
Here is the lowdown. 15 of 113 passed!
19 if you count the 1.5x overtime I gave just to see.
Full report: https://t.co/RglaGGablq
This is in line with my impressions after running it through DeepSWE too. This is the first in a new model series for them, so I think we will need to wait til the 3.5 release to really see what the inherent strength is to their architecture + post-training discipline. https://t.co/s5Co8db85v
@NVIDIAAI When you are flop-rich with GB200, Nemotron-style architectures better optimize for the hardware you have available. This may make Nemotron 3 Ultra the best blend of intelligence to tok/s to tok/megawatt of any model available this month.
There is a backstory to why @NVIDIAAI has stuck to 10% throughout the Nemotron 3 series, including the new 550B Ultra model, while most of the industry chases MoE with 3-5% activation.
LatentMoE is that story. They argue effective MoEs be evaluated by two dimensions: accuracy per FLOP and accuracy per parameter. The race toward 3-5% activation implicitly optimizes only the first.
https://t.co/GnT1t3XBoI
This is also a story of optimizing your architecture to specific hardware which you understand intimately. Nemotron 3 Ultra is the culmination of targeting flop-rich GB200 Blackwell clusters with a goal of reaching peak intelligence which can push towards 300+ tok/s efficiently.
While most of the industry chases MoE with 3-5% activation, NVIDIA has stuck to 10% throughout the Nemotron 3 series, including the new 550B Ultra model. Will be keen to see what they settle on for the next series' sparsity.