REP HUFFMAN: “The CEO of Trump’s Freedom 250 flew to Davos, stood in front of a room full of foreign governments and asked them ‘how they wanted to shape America’s birthday’. Then they stole hundreds of millions and committed wire fraud.”
Holy shit.
We benchmarked NVIDIA’s new Nemotron 3 Super in two modes **Thinking Off and High Thinking** across three medical evaluation sets: MedMCQA, MedCaseReasoning, and MedXpertQA.
Thinking Off outperformed High Thinking: 26.4% vs. 25.2% accuracy. The cost gap was much larger than the accuracy gap. High Thinking increased mean latency from 1.13s to 4.43s and mean completion length from 109 tokens to 1,089 tokens. In our setup, the higher-reasoning mode was much slower and more verbose, without improving aggregate results.
The benchmark-level split was more revealing than the overall average. On MedMCQA, accuracy dropped from 56.6% to 49.1% with High Thinking. On MedCaseReasoning, it also declined, from 24.4% to 20.2%.
The only clear gain was on MedXpertQA, where High Thinking improved accuracy from 9.2% to 15.0%. That pattern fits the benchmark design: MedMCQA rewards concise answer selection on constrained multiple-choice questions, while MedXpertQA is harder and more reasoning-intensive, so extra inference budget appears to help more there than on exam-style MCQs.
Across the overlap set, High Thinking improved 166 questions but flipped 182 previously correct answers into incorrect ones, explaining the net regression. Many of these looked like classic overthinking on structured medical multiple-choice items: the non-thinking run selected the correct answer directly, while High Thinking often chose a plausible distractor after longer deliberation.
Our main takeaway: Nemotron Super’s High Thinking mode should not be treated as a universal default.
In this experiment, it looked more like a specialized mode for harder expert synthesis than a general-purpose accuracy booster. For structured medical multiple-choice tasks, Thinking Off was both faster and more accurate. For harder expert-level reasoning tasks, especially those closer to MedXpertQA, additional reasoning showed some benefit.
The practical implication is that the reasoning depth should likely be routed by task type rather than enabled globally.
We used the @baseten Model API for these runs, and we’re grateful for their support from day one.
We’re also thankful to @NVIDIAAI for its commitment to open source. As a research team that transitioned fully to open-source models this year, we deeply appreciate this level of openness, weights, data, and recipes.
We also expect this model to be especially strong for orchestration and agent-style tasks, which is an area we’re excited to explore further.
For only the second time in our 179-year history, the editors of Scientific American are endorsing a candidate for president. That person is @KamalaHarris. | Editorial https://t.co/dOsFW8BQCn
@NotAttained@willc@urlichsanais@DrinkerOfTears@PaloAltoNtwks It’s two women in a slinky cocktail dress. If they were wearing work attire? Maybe. If it was one guy and one gal in roughly equivalent attire? Maybe. If it was two dudes in something slinky? How many “it’s just a pun” people would be irritated? Not rocket science.
@jab A decent analysis. Would have liked to see their opinion of the size of the effects of “greedflation” and supply chain interruption lag that still continues in some sectors. Consumerism outpaced these other economic forces, but by how much.
@ITSourceress@Cthulhu_Answers Even after years I still bristle when a higher up phrases things a certain way. It sticks with you. Hard not to remain guarded in even the regular 1:1 with a manager.
@ITSourceress@Cthulhu_Answers Still releasing tension from my last layoff. Heads were on the chopping block for months and empty CTO position. Toxic AF. Layoff before that they turned off my admin access the night before while I was working late. Didn’t sleep much that night. Fuels imposter syndrome for sure.
@marileezafari@ITSourceress That’s F’ed up. There is a special place in a very hot place for people who would do that to another. I wish you well in navigating it all.
@Creech Software, since I’ve been doing it professionally for 20+ years. Otherwise, maybe electronics? Some professionally, but mostly as a hobby, so I suppose it might add up to 10k+ hours. 🤔