J3m5Dev

@J3m5Dev

Building websites and tiny humans. Web developer, Node/Vue/Nuxt/Typescript enthusiast, and father of two.

Joined March 2017

1.7K Following

64 Followers

564 Posts

J3m5Dev @J3m5Dev

about 11 hours ago

@lethal_ai AI video based on a real one

J3m5Dev retweeted

Cloudflare @Cloudflare

4 days ago

VoidZero, the team behind Vite, Vitest, Rolldown, Oxc, and Vite+, is joining Cloudflare. Vite stays open source, vendor-agnostic, and built for everyone. https://t.co/DJTpX4Q9Xt

374

324

629K

J3m5Dev @J3m5Dev

3 days ago

@localhost_5173 Prompt injection?

J3m5Dev retweeted

Christoph Nakazawa

@cnakazawa

4 days ago

I’ve been dumping on OpenAI with low effort meme tweets that get too many views, but Codex is the best DevX acceleration product of all time and I wrote about it here: https://t.co/ABnaXqL45x

229

192

36K

Who to follow

Yop!

@LelivreDavid4

« La vie mettra des pierres sur ta route. A toi de décider d'en faire des murs ou des ponts. » -Coluche-

JotunValiⓋ✒️🍉

@JotunVali

⏳33 I like thick tits, gay octopi & gay lawyers. 🌳 Linktree: https://t.co/n2p9dgv4Hy 💗AO3 ❤️Youtube 💙Tumblr 📸Instagram ☕Kofi #BoycottDisney

4 days ago

@xmrafonso @Atinux I mean using other front-end frameworks than Vue in Nuxt, something like Vike

J3m5Dev retweeted

am.will

@LLMJunky

4 days ago

Codex App for Linux now supports Remote Control! Sorry it took so long, been a bit busy. Update to latest version to get started. Your move @ajambrosino 😏

143

12K

J3m5Dev retweeted

Christoph Nakazawa

@cnakazawa

5 days ago

New Post: Wrote about my latest LLM Workflow & Modern Engineering Values: https://t.co/820c034Bgt

268

371

36K

J3m5Dev retweeted

Vals AI

@ValsAI

5 days ago

ProgramBench is now live on the Vals site! Opus 4.8 is the first model to fully solve 2 tasks, but this comes at an extremely high cost.

ValsAI's tweet photo. ProgramBench is now live on the Vals site! Opus 4.8 is the first model to fully solve 2 tasks, but this comes at an extremely high cost. https://t.co/Oq7tt1itmP

193

66K

J3m5Dev retweeted

@nrehiew_

5 days ago

Super detailed tech report for MAI-Thinking-1, with a ton of info on all stages of the pipeline. I'm surprised so much of this info is released :) Super long thread on my notes:

nrehiew_'s tweet photo. Super detailed tech report for MAI-Thinking-1, with a ton of info on all stages of the pipeline. I'm surprised so much of this info is released :)

Super long thread on my notes: https://t.co/uCtan39KUp

158

115

21K

J3m5Dev retweeted

elie

@eliebakouch

5 days ago

microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale. this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab. the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale let's look at all of this in this likely very long thread 🧵

eliebakouch's tweet photo. microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale.

this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab.

the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale

let's look at all of this in this likely very long thread 🧵

264

276K

J3m5Dev @J3m5Dev

5 days ago

@thsottiaux Now we need $100-$200 business seats to use all these goodies Tibo!

J3m5Dev @J3m5Dev

6 days ago

@kcosr @jxnlco @scaling01 You already can https://t.co/Yaf3rtAEOT

J3m5Dev retweeted

Bleys Goodson

@bleysg

6 days ago

Since everyone is asking, I ran DeepSWE on MiniMax M3. Here is the lowdown. 15 of 113 passed! 19 if you count the 1.5x overtime I gave just to see. Full report: https://t.co/RglaGGablq

bleysg's tweet photo. Since everyone is asking, I ran DeepSWE on MiniMax M3.

Here is the lowdown. 15 of 113 passed!

19 if you count the 1.5x overtime I gave just to see.

Full report: https://t.co/RglaGGablq https://t.co/M97wHmPAzp

446

146K

J3m5Dev retweeted

Justin Schroeder

@jpschroeder

6 days ago

Announcing: API for Cursor – use Composer 2.5 with any harness. An open source macOS app that exposes an API for Cursor's models. Instantly use Composer 2.5 in Codex, OpenCode, Cline etc... ➡️ https://t.co/SeoJtR5T2l

jpschroeder's tweet photo. Announcing: API for Cursor – use Composer 2.5 with any harness.

An open source macOS app that exposes an API for Cursor's models. Instantly use Composer 2.5 in Codex, OpenCode, Cline etc...

➡️ https://t.co/SeoJtR5T2l https://t.co/Q85pgVZHlA

770

85K

J3m5Dev retweeted

Qwen

@Alibaba_Qwen

6 days ago

👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation. ✅ Multimodal interactive hybrid agent: unified GUI & CLI operation across visual and text tasks ✅ Versatile coding agent & productivity assistant with full-modality input ✅ Visual Agent: perception, reasoning, grounding, and search-augmented QA ✅ Cross-harness generalization across diverse agent frameworks One model. Sees, thinks, codes, acts.🙌🙌 Now available via API on Alibaba Cloud Model Studio. Try it — let us know what you build.😎 🔗🔗⬇️⬇️ Blog：https://t.co/pVYf0h3NNa Qwen Studio：https://t.co/HUYgFW4cYf API：https://t.co/viL0cXrMzW

Alibaba_Qwen's tweet photo. 👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation.

✅ Multimodal interactive hybrid agent: unified GUI & CLI operation across visual and text tasks
✅ Versatile coding agent & productivity assistant with full-modality input
✅ Visual Agent: perception, reasoning, grounding, and search-augmented QA
✅ Cross-harness generalization across diverse agent frameworks

One model. Sees, thinks, codes, acts.🙌🙌

Now available via API on Alibaba Cloud Model Studio. Try it — let us know what you build.😎

🔗🔗⬇️⬇️
Blog：https://t.co/pVYf0h3NNa
Qwen Studio：https://t.co/HUYgFW4cYf
API：https://t.co/viL0cXrMzW

253

454

702

462K

J3m5Dev retweeted

Maxi | Formula-100 @_onmax

8 days ago

Shipping vite-doctor! A diagnostics tool for Vite, Vue, Nuxt and Nitro projects, built for the framework-specific issues AI agents keep missing. Stop pushing AI slop! https://t.co/Gz7hgs1BBw

_onmax's tweet photo. Shipping vite-doctor!

A diagnostics tool for Vite, Vue, Nuxt and Nitro projects, built for the framework-specific issues AI agents keep missing.

Stop pushing AI slop!

https://t.co/Gz7hgs1BBw https://t.co/Au2wcDHbdI

J3m5Dev retweeted

MiniMax (official) @MiniMax_AI

7 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

544

10K

J3m5Dev retweeted

Ibragim

@ibragim_bad

7 days ago

📊 More insights on GPT-5.5 vs Opus 4.8 based on SWE-rebench runs TL;DR: Opus 4.8 became much more token-efficient than 4.6, but GPT-5.5 is still the most efficient: more solved tasks, fewer tokens, fewer steps. 🏆 SWE-rebench is a live benchmark with fresh SWE tasks (issue+PR) from GitHub. Detailed table of the results and the leaderboard link are in the thread. Findings: > GPT-5.5 medium looks noticeably more efficient than Opus 4.8 high, if we compare the default reasoning-effort modes for both models. > Opus really became much more optimized from 4.6 → 4.8 on high: more solved tasks, 45% fewer tokens per task, and around 39% lower cost/problem. > Opus 4.8 high is almost not better than Opus 4.7 high by score, but it is much cheaper in compute. Tokens/task went down 1.53M → 1.01M, and steps went down 43.7 → 34.2. > GPT-5.5 medium also became more token-efficient than GPT-5.4 medium, but more expensive because the base pricing increased. Tokens per task went down by 15%, score increased, but the cost of solving a task increased by 63% while base pricing increased 2x. Another useful metric, when you have several runs, is pass^5. Here we count a task only if it was solved in all 5 runs. For GPT-5.5 medium, pass@5 almost did not change compared to GPT-5.4 medium: 77 vs 78. > But pass^5 increased a lot: 51 vs 39! This means GPT-5.5 medium solves tasks “randomly once” less often, and much more often solves the same task consistently in all 5 runs. For Opus, this number is almost the same between model versions, but it changes a lot depending on reasoning mode: high → xhigh. > Many people ask why GPT-5.5 xhigh gets a higher score than medium, or why one model beats another on these tasks. On the surface, it looks like one model solved the task and another did not. But usually it is not a full failure. Very often the model gets to an almost correct solution, but misses some edge cases or corner cases covered by tests. In xhigh reasoning, GPT makes many more steps to explore the repository and more actively tests its own solutions, including writing additional tests. This helps to catch these corner cases, but the price is high. GPT-5.5 medium: 58.9% → 62.7% pass@1, $0.98 → $2.25 > GLM 5.1 looks competitive by pass@5, but it has a very heavy trajectory. So I believe that it could be RL’ed to get even better results in terms of pass@1 and more efficient token count, like Composer 2.5, for example. P.S Please write if you have questions, or what hypotheses we should check on trajectories. We are working on releasing all the trajectories, so you can do some analysis on them as well

ibragim_bad's tweet photo. 📊 More insights on GPT-5.5 vs Opus 4.8 based on SWE-rebench runs

TL;DR: Opus 4.8 became much more token-efficient than 4.6, but GPT-5.5 is still the most efficient: more solved tasks, fewer tokens, fewer steps.

🏆 SWE-rebench is a live benchmark with fresh SWE tasks (issue+PR) from GitHub.

Detailed table of the results and the leaderboard link are in the thread.

Findings:
> GPT-5.5 medium looks noticeably more efficient than Opus 4.8 high, if we compare the default reasoning-effort modes for both models.

> Opus really became much more optimized from 4.6 → 4.8 on high: more solved tasks, 45% fewer tokens per task, and around 39% lower cost/problem.

> Opus 4.8 high is almost not better than Opus 4.7 high by score, but it is much cheaper in compute. Tokens/task went down 1.53M → 1.01M, and steps went down 43.7 → 34.2.

> GPT-5.5 medium also became more token-efficient than GPT-5.4 medium, but more expensive because the base pricing increased.
Tokens per task went down by 15%, score increased, but the cost of solving a task increased by 63% while base pricing increased 2x.
Another useful metric, when you have several runs, is pass^5. Here we count a task only if it was solved in all 5 runs.
For GPT-5.5 medium, pass@5 almost did not change compared to GPT-5.4 medium: 77 vs 78.

> But pass^5 increased a lot: 51 vs 39! This means GPT-5.5 medium solves tasks “randomly once” less often, and much more often solves the same task consistently in all 5 runs.
For Opus, this number is almost the same between model versions, but it changes a lot depending on reasoning mode: high → xhigh.

> Many people ask why GPT-5.5 xhigh gets a higher score than medium, or why one model beats another on these tasks. On the surface, it looks like one model solved the task and another did not. But usually it is not a full failure. Very often the model gets to an almost correct solution, but misses some edge cases or corner cases covered by tests.

In xhigh reasoning, GPT makes many more steps to explore the repository and more actively tests its own solutions, including writing additional tests. This helps to catch these corner cases, but the price is high.
GPT-5.5 medium: 58.9% → 62.7% pass@1, $0.98 → $2.25

> GLM 5.1 looks competitive by pass@5, but it has a very heavy trajectory. So I believe that it could be RL’ed to get even better results in terms of pass@1 and more efficient token count, like Composer 2.5, for example.

P.S
Please write if you have questions, or what hypotheses we should check on trajectories. We are working on releasing all the trajectories, so you can do some analysis on them as well

J3m5Dev retweeted

Ryan Shea

@ryaneshea

8 days ago

As I dig in more, I'm finding better ways of sizing up Opus 4.8 vs GPT-5.5... The standard way of launching AI models by conveying their performance with benchmark comparison cards is severely lacking and doesn't tell the whole story. You see, with every model launch, AI labs cherrypick benchmarks and then highlight the ones where their model exceeds the flagship models of their main competitors, and design this set so that they look good. But these comparisons leave out a lot of key information. So I put together a more expansive and accurate way of comparing Opus 4.8 and GPT-5.5, across coding, reasoning, computer use, and reliability. You'll see that Opus 4.8 is noticeably better on some metrics like SWE-Bench Pro (software engineering), OSWorld-Verified (computer navigation), and AA-Omniscience (anti-hallucination) while GPT-5.5 is noticeably better on other metrics like SWE-rebench (software engineering), CritPt (physics reasoning) and IFBench (instruction following). All in all, Opus 4.8 and GPT-5.5 look like pretty comparable models. They're both state of the art, they are both fantastic at most tasks, and they each have their own strengths and weaknesses. I'll be incorporating some of the benchmarks from this analysis into AI IQ (a site that estimates the IQ's of popular AI models) and these updates should provide a more expansive and accurate picture of model performance. Also I'll share additional charts below. And let me know if there's anything I'm missing or if there's anything else that'd be helpful to include alongside what I featured.

J3m5Dev

@J3m5Dev

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users