Sam Snelling @snellingio - Twitter Profile

about 12 hours ago

Here's a baseline quick bench between Anthropic and OpenAI models which I'm now content with. This test runs Warden's stock securitiy-review skill across some known Sentry priors. These are security findings (excluding defense-in-depth) of various levels. I can't explain the changes in Opus 4.7/4.8, but what I can tell you is: its simply not a good product. If a model update breaks usage of a model, I'll use. different model. I'll likely be exploring switching our implementation to run GPT 5.5 (low) for PR review, as that's the right cost tradeoff we're looking for. Generally speaking, the way you can read this is simple: Higher % of known is desirable. Higher total findings is a secondary value add, but cost is the tradeoff. More findings != exclusively more value. It likely means it found more lower sev (or in this case, defense-in-depth-style), that you may not even care to address. In general consider this just an interesting unscientific experiment, one that anecdotally is important but not aimining to be a true benchmark. Read more in the Warden docs: https://t.co/3ooTkswjUp p.s. if you find something wrong in the data please let me know

zeeg's tweet photo. Here's a baseline quick bench between Anthropic and OpenAI models which I'm now content with.

This test runs Warden's stock securitiy-review skill across some known Sentry priors. These are security findings (excluding defense-in-depth) of various levels.

I can't explain the changes in Opus 4.7/4.8, but what I can tell you is: its simply not a good product. If a model update breaks usage of a model, I'll use. different model.

I'll likely be exploring switching our implementation to run GPT 5.5 (low) for PR review, as that's the right cost tradeoff we're looking for.

Generally speaking, the way you can read this is simple:

Higher % of known is desirable. Higher total findings is a secondary value add, but cost is the tradeoff.

More findings != exclusively more value. It likely means it found more lower sev (or in this case, defense-in-depth-style), that you may not even care to address.

In general consider this just an interesting unscientific experiment, one that anecdotally is important but not aimining to be a true benchmark.

Read more in the Warden docs: https://t.co/3ooTkswjUp

p.s. if you find something wrong in the data please let me know

15

61

10

19

7K

Sam Snelling

@snellingio

1 day ago

at 50gp, bottles should be priced in! i’m all for incentivizing recycling… but effectively raising prices with hidden junk fees to cover workers comp for Crumb because you’re not adequately protecting him on bottle recovery is going to make me shop for health potions elsewhere

Nerdcognito

@Nerdcognito

2 days ago

For just 5gp, you too can save a Crumb.

76

31K

2K

1K

608K

0

71

Sam Snelling

@snellingio

1 day ago

“compute is roughly doubling in price” they really buried the lede huh

Ralph J. Smit @ralphjsmit

1 day ago

I really love Laravel Cloud, but am a bit sad about the pricing changes. The free plan is now $5 (with $5 credit, no preview environments), and compute is roughly doubling in price. So for most people it's the same cost or more. So not really cheaper unless you use ultralow RAM

ralphjsmit's tweet photo. I really love Laravel Cloud, but am a bit sad about the pricing changes. The free plan is now $5 (with $5 credit, no preview environments), and compute is roughly doubling in price. So for most people it's the same cost or more.

So not really cheaper unless you use ultralow RAM https://t.co/wLib83FNl0

0

7

1

0

463

0

59

snellingio retweeted

David Cramer

@zeeg

2 days ago

Would it shock anyone if I told you Opus 4.6 performs better than 4.7 and 4.8

88

366

5

36

61K

Who to follow

C.T.

@clichecontrary

//Simple person. With a complicated Mind. || #LongLiveKeith

Rƴan LaBouve

@RyanLaBouve

Christ is King. Train, protect, lead, respond. Security @Alchemy. Products https://t.co/qZKcVSdYfT. SAR @OKCMSAR

Sam Snelling

@snellingio

1 day ago

choice is good, but it’s going to be hard to convince me to give this a chance at today’s component pricing

Max Weinbach

@mweinbach

2 days ago

My view on RTX Spark and Surface Laptop Ultra and new Windows updates at build Microsoft might get me back to Windows. Surface Laptop Ultra is nearly a MacBook Pro with similar specs and performance. The Windows updates solve the annoying stuff with Windows… hmmm

18

360

11

35

31K

0

58

snellingio retweeted

Tim Soulo 🇺🇦

@timsoulo

2 days ago

In the last 6 months at @Ahrefs, we analyzed over 1 billion data points across 14 studies. Here's what we learned about AI search optimization: 1) "Best X" blog listicles are the single most prominent content format cited by AI chatbots. They make up 43.8% of all page types cited by ChatGPT specifically. 2) 67% of ChatGPT's top 1,000 citations come from sources marketers can't influence: Wikipedia (29.7%), homepages (23.8%), app stores (6.6%). Only 32.3% are influenceable content like educational pages, reviews, news, and blog posts. 3) 28.3% of ChatGPT's most-cited pages have zero Google organic visibility. These pages get cited repeatedly by ChatGPT despite not ranking in Google at all. A completely separate discovery layer. 4) ChatGPT only cites about 50% of the URLs it retrieves. It fetches dozens of pages per query but uses half as background context without attribution. This means that being retrieved and being cited are very different things. 5) Adding schema markup had zero meaningful impact on AI citations. AI Overviews actually dipped −4.6%, while AI Mode (+2.4%) and ChatGPT (+2.2%) showed changes indistinguishable from zero. 6) YouTube mentions have the highest correlation (0.737) with AI brand visibility out of all the factors we studied (including all the conventional SEO metrics like backlinks, page count, DR, etc). This held true for both Google-owned and OpenAI products. 7) AI Overviews reduce clicks to the #1 result by 58%. That’s up from 34.5% just 10 months earlier. The trend is accelerating. 8) 99.9% of AI Overviews appear on informational intent queries. Transactional, navigational, and local searches are almost entirely AIO-free. Shopping triggers AIOs just 3.2% of the time. 9) For a given search query, Google’s AI Mode and AI Overviews reach the same conclusions 86% of the time — but cite almost entirely different sources (only 13.7% citation overlap). 10) AI Overviews change every 2.15 days on average, with 70% of content differing between consecutive observations. But semantic similarity stays at 0.95. The words, sources, and entities constantly shuffle, but the actual meaning barely moves.

137

3K

378

7K

952K

snellingio retweeted

Corey Quinn

@QuinnyPig

3 days ago

Hello and welcome to #MicrosoftBuild, friends.

4

281

27

10

12K

snellingio retweeted

ModelScope

@ModelScope2022

3 days ago

Congrats to @PaddlePaddle on the open release of PaddleOCR-VL-1.6! 🚀 An upgraded compact document parsing model hitting a 96.33% score on OmniDocBench v1.6, outperforming top-tier VLMs.🤖 https://t.co/xovFqwPeDW 📊 Benchmark Accuracy: Achieves 96.33% on OmniDocBench v1.6, while setting new performance records on OmniDocBench v1.5 and Real5-OmniDocBench for text, formula, and table recognition. ⚡ Upgraded Capabilities: Built with a region-aware data optimization framework. Brings significant precision leaps in tables, charts, seals, Chinese ancient documents, and rare characters. 🧠 Staged Optimization: Adopts a progressive post-training recipe based on curated data selection and reinforcement learning to enhance supervision signals. 🔄 Seamless Migration: Model architecture is fully compatible with PaddleOCR-VL-1.5 for a zero-cost plug-and-play replacement.

ModelScope2022's tweet photo. Congrats to @PaddlePaddle on the open release of PaddleOCR-VL-1.6! 🚀
An upgraded compact document parsing model hitting a 96.33% score on OmniDocBench v1.6, outperforming top-tier VLMs.🤖
https://t.co/xovFqwPeDW
📊 Benchmark Accuracy: Achieves 96.33% on OmniDocBench v1.6, while setting new performance records on OmniDocBench v1.5 and Real5-OmniDocBench for text, formula, and table recognition.
⚡ Upgraded Capabilities: Built with a region-aware data optimization framework. Brings significant precision leaps in tables, charts, seals, Chinese ancient documents, and rare characters.
🧠 Staged Optimization: Adopts a progressive post-training recipe based on curated data selection and reinforcement learning to enhance supervision signals.
🔄 Seamless Migration: Model architecture is fully compatible with PaddleOCR-VL-1.5 for a zero-cost plug-and-play replacement.

1

70

6

25

4K

snellingio retweeted

Adina Yakup

@AdinaYakup

3 days ago

Keye VL 2.0-30B-A3B 🔥 New multimodal model from @KwaiKeye ✨ 30B/3B active - Apache 2.0 ✨ 256K context via DeepSeek Sparse Attention (probably the first model to ship this in production?👀) ✨ Gets MORE accurate as you feed it more frames ✨ Matchs Qwen3 VL and Gemini 3 Flash on benchmarks

AdinaYakup's tweet photo. Keye VL 2.0-30B-A3B 🔥 New multimodal model from @KwaiKeye

✨ 30B/3B active - Apache 2.0
✨ 256K context via DeepSeek Sparse Attention (probably the first model to ship this in production?👀)
✨ Gets MORE accurate as you feed it more frames
✨ Matchs Qwen3 VL and Gemini 3 Flash on benchmarks

9

257

27

131

14K

snellingio retweeted

JetBrains @jetbrains

3 days ago

Mellum started with code completion. Mellum2 is built for more – handling both natural language and code. A 12B-parameter open-source LLM for routing, RAG, and sub-agents, optimized for ultra-low-latency inference. Now on @huggingface. Learn more: https://t.co/28sG8Ql52L

24

327

64

147

97K

snellingio retweeted

Florian Brand

@xeophon

5 days ago

Hosted evals are finally live!! Smal vid showing how to use them, more to come : )

11

190

24

60

31K

snellingio retweeted

Qwen

@Alibaba_Qwen

3 days ago

👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation. ✅ Multimodal interactive hybrid agent: unified GUI & CLI operation across visual and text tasks ✅ Versatile coding agent & productivity assistant with full-modality input ✅ Visual Agent: perception, reasoning, grounding, and search-augmented QA ✅ Cross-harness generalization across diverse agent frameworks One model. Sees, thinks, codes, acts.🙌🙌 Now available via API on Alibaba Cloud Model Studio. Try it — let us know what you build.😎 🔗🔗⬇️⬇️ Blog：https://t.co/pVYf0h3NNa Qwen Studio：https://t.co/HUYgFW4cYf API：https://t.co/viL0cXrMzW

Alibaba_Qwen's tweet photo. 👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation.

✅ Multimodal interactive hybrid agent: unified GUI & CLI operation across visual and text tasks
✅ Versatile coding agent & productivity assistant with full-modality input
✅ Visual Agent: perception, reasoning, grounding, and search-augmented QA
✅ Cross-harness generalization across diverse agent frameworks

One model. Sees, thinks, codes, acts.🙌🙌

Now available via API on Alibaba Cloud Model Studio. Try it — let us know what you build.😎

🔗🔗⬇️⬇️
Blog：https://t.co/pVYf0h3NNa
Qwen Studio：https://t.co/HUYgFW4cYf
API：https://t.co/viL0cXrMzW

250

4K

453

700

451K

snellingio retweeted

MiniMax (official) @MiniMax_AI

4 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

535

9K

1K

3K

3M

Sam Snelling

@snellingio

5 days ago

@assertchris @cmgmyr something to watch out for: usually tool definitions are put at the very beginning, like before the system prompt. so by turning tools on and oof, you end up breaking the key value cache. this is more expensive and slower as it needs to recompute the entire context.

1

2

0

30

snellingio retweeted

Adam Holter

@AdamHoltererer

7 days ago

"I can quit any time."

13

1K

13

8

67K

snellingio retweeted

StepFun @StepFun_ai

7 days ago

⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency. #1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0. Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution. - 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels. - Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees. - Web + visual search reaches further: more sources, deeper follow-up. - Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels. - Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP. - Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395. GitHub: https://t.co/kqlZkVIRHv HuggingFace: https://t.co/qqceCrgPiw GGUF: https://t.co/rR6XrnymWG ModelScope: https://t.co/wney6Tzvqy API: https://t.co/RvHWzRG7Fu Blog: https://t.co/BxDiajiQ5G

StepFun_ai's tweet photo. ⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency.

#1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0.

Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution.

- 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels.
- Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees.
- Web + visual search reaches further: more sources, deeper follow-up.
- Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels.
- Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP.
- Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395.

GitHub: https://t.co/kqlZkVIRHv
HuggingFace: https://t.co/qqceCrgPiw
GGUF: https://t.co/rR6XrnymWG
ModelScope: https://t.co/wney6Tzvqy
API: https://t.co/RvHWzRG7Fu
Blog: https://t.co/BxDiajiQ5G

116

2K

208

620

334K

Sam Snelling

@snellingio

7 days ago

man, i get that benchmarks are dumb and bad, but i am not relying on big vibes. give me something, anything, measurable and objective.

ThePrimeagen

@ThePrimeagen

7 days ago

benchmarks are stupid for models Just ignore them

124

2K

55

20

59K

0

1

0

70

Sam Snelling

@snellingio

7 days ago

@zeeg same r/e codex surely you’ll get some feeling on if it’s worth upgrading just based on warden. you don’t have to use it, but you should tweet out an update eventually! even if it’s the teams vibes

0

1

0

96

Sam Snelling

@snellingio

7 days ago

i’m waiting for david’s review on 4.8. after the sentry guidance on 4.7 being a lot more expensive, i moved back to 4.6 and never thought about it again.

David Cramer

@zeeg

7 days ago

i heard theres a new model release today? is it going to help you get to PMF? attract new customers? solve a problem you couldn't solve before? if not, consider why you even care

21

149

9

5

15K

1

7

0

4K

Sam Snelling

@snellingio

7 days ago

haiku 4.5 in shambles

Haider.

@haider1

7 days ago

i'd be happier if anthropic released sonnet 4.8 instead of opus 4.8 sonnet 4.5 really had the best balance like the writing felt natural, the characters reacted in the right way sonnet 4.6 and opus 4.7 is almost useless for creative work opus 4.6 is okay, but it uses too many tokens

39

511

17

28

57K

0

66

Sam Snelling

@snellingio

7 days ago

this model shape is extremely attractive. will have to give it a try

Liquid AI

@liquidai

7 days ago

Today, we're releasing LFM2.5-8B-A1B, a device-optimized model designed to power real-life applications on phones, laptops, PCs, robots, and fast & lightweight server-side use-cases. > 8B MoE, 1.5B active > Expanded 128K context > LFM2.5 flagship hybrid MoE architecture > Trained on 38T tokens + large-scale RL > fast, reliable tool calling, punching above its weight, comparable to models with up to 4x its size > customizable on a single GPU for any specialized task > LFM2 open-weight license 🧵

liquidai's tweet photo. Today, we're releasing LFM2.5-8B-A1B, a device-optimized model designed to power real-life applications on phones, laptops, PCs, robots, and fast & lightweight server-side use-cases.

> 8B MoE, 1.5B active
> Expanded 128K context
> LFM2.5 flagship hybrid MoE architecture
> Trained on 38T tokens + large-scale RL
> fast, reliable tool calling, punching above its weight, comparable to models with up to 4x its size
> customizable on a single GPU for any specialized task
> LFM2 open-weight license

🧵

138

4K

502

3K

1M

0

4

0

456

Sam Snelling

@snellingio

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users