Ivek @lord_ivek - Twitter Profile

4 days ago

Agentic knowledge work can take frontier models over 20 minutes per task, as measured in AA-Briefcase, our new benchmark Last week we released AA-Briefcase, our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models, board presentations, and design mock-ups in the context of realistic multi week projects. One of the key metrics we measure in AA-Briefcase is average time per task. This is calculated using evaluation token usage, representative model output speeds, and tool execution time recorded during evaluation. Key time per task takeaways from AA-Briefcase: ➤ Claude Opus 4.8 is the highest-scoring available model, but it is also one of the slowest, taking ~23 minutes per task on average ➤ Several GPT-5.5 reasoning variants lie along the Pareto frontier of AA-Briefcase Elo vs. Time per Task, including medium, high, and xhigh. GPT-5.5 (xhigh) in particular stands out as one of the most efficient top-performing models, using around half the time per task of Opus 4.8 (11 minutes) while ranking top 5 on the overall AA-Briefcase Elo ➤ GLM-5.2 also sits on the Pareto frontier, scoring 1261, ahead of GPT-5.5 (xhigh, 1159) but also taking more time per task (16.3 minutes). It is also the top-performing open weights model on AA-Briefcase, with MiniMax-M3 the next best at 1113 ➤ If Claude Fable 5 were still available, it would likely take around 28.5 minutes per task: while it was live, we measured ~91 output tokens per second, ~3.1 minutes of tool execution time per task, and ~139,000 output tokens per task ➤ Time spent on tool calls and execution accounts for only ~12% of the total time, with the remaining amount explained by output verbosity, turn usage, and inference speed

ArtificialAnlys's tweet photo. Agentic knowledge work can take frontier models over 20 minutes per task, as measured in AA-Briefcase, our new benchmark

Last week we released AA-Briefcase, our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models, board presentations, and design mock-ups in the context of realistic multi week projects.

One of the key metrics we measure in AA-Briefcase is average time per task. This is calculated using evaluation token usage, representative model output speeds, and tool execution time recorded during evaluation.

Key time per task takeaways from AA-Briefcase:

➤ Claude Opus 4.8 is the highest-scoring available model, but it is also one of the slowest, taking ~23 minutes per task on average

➤ Several GPT-5.5 reasoning variants lie along the Pareto frontier of AA-Briefcase Elo vs. Time per Task, including medium, high, and xhigh. GPT-5.5 (xhigh) in particular stands out as one of the most efficient top-performing models, using around half the time per task of Opus 4.8 (11 minutes) while ranking top 5 on the overall AA-Briefcase Elo

➤ GLM-5.2 also sits on the Pareto frontier, scoring 1261, ahead of GPT-5.5 (xhigh, 1159) but also taking more time per task (16.3 minutes). It is also the top-performing open weights model on AA-Briefcase, with MiniMax-M3 the next best at 1113

➤ If Claude Fable 5 were still available, it would likely take around 28.5 minutes per task: while it was live, we measured ~91 output tokens per second, ~3.1 minutes of tool execution time per task, and ~139,000 output tokens per task

➤ Time spent on tool calls and execution accounts for only ~12% of the total time, with the remaining amount explained by output verbosity, turn usage, and inference speed

18

238

17

64

24K

Lord_ivek retweeted

Arena.ai

@arena

6 days ago

Seed 2.1 Pro Preview ranks #8 in Code Arena: Frontend, scoring 1539, on par with Opus 4.6. It performs strongly on React apps and lands in the top 10 for five of seven subcategories. In those areas, only a handful of frontier labs rank above it: @Zai_Org's GLM-5.2 and @AnthropicAI's Claude models. Highlights: - #7 on the React leaderboard, #14 on HTML - #6 Brand & Marketing - #9 Content Creation Tools, Data & Analytics =#10 Reference Based Design, Consumer Product This is an early access preview of the model. It will be publicly available in a few weeks.

39

715

66

114

145K

Lord_ivek retweeted

Enséñame de Ciencia

@EnsedeCiencia

17 days ago

Los árboles no solo producen oxígeno, también ayudan a regular temperaturas de forma natural. 🖤

10

809

247

32

11K

Lord_ivek retweeted

Lisan al Gaib

@scaling01

24 days ago

MiniMax-M3 is now the best open-source model on the AAI It's ahead of Kimi-K2.6, GLM-5.1 and DeepSeek-V4-Pro On the Coding Index it's slightly behind V4-Pro and K2.6, but it's beating them on the Agentic Index CritPt scores are a bit worrisome. But the biggest highlight here is probably its reasoning efficiency compared to other open models. (but still miles behind closed models)

scaling01's tweet photo. MiniMax-M3 is now the best open-source model on the AAI

It's ahead of Kimi-K2.6, GLM-5.1 and DeepSeek-V4-Pro

On the Coding Index it's slightly behind V4-Pro and K2.6, but it's beating them on the Agentic Index

CritPt scores are a bit worrisome.

But the biggest highlight here is probably its reasoning efficiency compared to other open models.
(but still miles behind closed models)

32

493

38

119

41K

Who to follow

alonso

@alonso7asce

Motos, Gaming y Pizza. 🏍️🎮🍕 Papá y Streamer. Viviendo la vida entre la ruta y el control. ⚡

about 2 months ago

@AlvarezMaynez @SEP_mx Lo dicen los padres que tienen a sus hijos en escuelas climatizadas de concreto con todas las amenidades, llevenlos a las escuelas sin ventiladores, y techos de lámina a 35-40°C

0

5

0

165

Lord_ivek retweeted

OpenRouter

@OpenRouter

about 2 months ago

Introducing Pareto Code: a new, free, experimental coding router Set `min_coding_score` in your request and route to the cheapest code-capable model that clears your bar, ranked by @ArtificialAnlys. See the Pareto frontier shifting in real time👇

OpenRouter's tweet photo. Introducing Pareto Code: a new, free, experimental coding router

Set `min_coding_score` in your request and route to the cheapest code-capable model that clears your bar, ranked by @ArtificialAnlys.

See the Pareto frontier shifting in real time👇 https://t.co/26e1HeGOpp

48

1K

73

387

243K

Ivek @Lord_ivek

2 months ago

@jun_song Deepseek si señor

0

17

Lord_ivek retweeted

Daily Magazin

@daily_magazin

2 months ago

Spor ve hava durumu sunucusu Em Rose, canlı yayın öncesi hazırlık anlarıyla sosyal medyada gündem oldu. ▪️Siyah dantelli iddialı sahne kıyafetiyle dikkat çeken Rose’un, yayına saniyeler kala ekip tarafından yapılan son dokunuşları kameralara yansıdı. ▪️Profesyonelliği ve rahat tavırlarıyla beğeni toplayan sunucu, kısa sürede sosyal medyada en çok konuşulan isimlerden biri haline geldi.

196

15K

553

12K

5M

Ivek @Lord_ivek

2 months ago

@platzi Sii, yo lo uso en mi proyecto, las chinitas porque es en un clic ya se comió el millón de tokens jaja, con openai,

0

139

Lord_ivek retweeted

Rafael!

@Rafaelbarreira7

2 months ago

Vai assombrar em outro lugar 😂😂😂

189

6K

847

575

644K

Ivek @Lord_ivek

2 months ago

@ValsAI I'm a Deepseek lover, I'm hoped, I want to do testing

0

119

Ivek @Lord_ivek

2 months ago

@jun_song Usaré gpt 5.5 pero el que más espero para mi proyecto es deepseek si mejoró mucho tendré que hacerle cambios ja

0

250

Lord_ivek retweeted

Artificial Analysis

@ArtificialAnlys

2 months ago

Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even accounting for Opus 4.7's new tokenizer

ArtificialAnlys's tweet photo. Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even accounting for Opus 4.7's new tokenizer

1

20

1

3K

Ivek @Lord_ivek

3 months ago

@Marcela252016 Apa lo hiciste con IA o te salió en el cielo, esta perfecto

0

6

Ivek @Lord_ivek

3 months ago

@OpenAIDevs Una app con django

0

3

Lord_ivek retweeted

Noa Gresiva

@NoaGresiva

3 months ago

Vosotros sois muy jóvenes y no lo recordaréis, pero hace aproximadamente 2 años y poco, Israel le disparó directamente a civiles que iban a buscar comida y medicinas a los camiones que conseguían atravesar el BLOQUEO de los colonos en las carreteras. Así es el sionismo.

256

27K

10K

2K

909K

Lord_ivek retweeted

La Catrina Norteña

@catrina_nortena

3 months ago

#ULTIMAHORA 🚨 CONFIRMACIÓN BOMBA: Medios Estadounidenses afirman que el supuesta ataque de Irán a Baréin fue con un Misil Tomahawk de Estados Unidos Se confirma que EU e Israel están haciendo ataques de falsa bandera para involucrar a todos los países arabes ya que solos no pueden. Por favor compartan este video antes de que la Casa Blanca lo mande bajar de las redes...

301

21K

14K

840

628K

Lord_ivek retweeted

Nenedenadie

@nenedenadie

3 months ago

🚨Mátame camión 🤦🏻‍♂️ ⭕️ Nivel medio de algunos jóvenes en EEUU: ➖“¿Qué es un ayatolá?” ➖“¿Venezuela no está en España?” ➖“No sé dónde está… pero la borraría del mapa.” Y esta gente vota, decide guerras y lidera el mundo.

565

15K

4K

3K

1M

Lord_ivek retweeted

Mark Kretschmann

@mark_k

3 months ago

AI models have now reached Mensa-level IQ and beyond, making them smarter than most humans! 🧠 A new chart by Tracking AI ranks the top large language models on a bell curve spanning 50 to 150. Lower scores in the 60-80 range include Qwen 3.5, Mistral Medium 3.1, Bing Copilot, and OpenAI GPT 5.3 variants. The mid-range around 90-110 features Claude 4.6 Opus, Perplexity, Gemini 3.1 previews, and Llama 4 Maverick. Above 120 sit Kimi K2.5, DeepSeek models, and Grok 4 Vision and Thinking from xAI. The highest marks near 145 go to OpenAI’s latest GPT 5.4 Pro Vision and ChatGPT 5 PRO Vision. AI is advancing fast!

mark_k's tweet photo. AI models have now reached Mensa-level IQ and beyond, making them smarter than most humans! 🧠

A new chart by Tracking AI ranks the top large language models on a bell curve spanning 50 to 150.

Lower scores in the 60-80 range include Qwen 3.5, Mistral Medium 3.1, Bing Copilot, and OpenAI GPT 5.3 variants. The mid-range around 90-110 features Claude 4.6 Opus, Perplexity, Gemini 3.1 previews, and Llama 4 Maverick.

Above 120 sit Kimi K2.5, DeepSeek models, and Grok 4 Vision and Thinking from xAI. The highest marks near 145 go to OpenAI’s latest GPT 5.4 Pro Vision and ChatGPT 5 PRO Vision.

AI is advancing fast!

36

200

28

41

11K

Lord_ivek retweeted

Elon Musk

@elonmusk

3 months ago

“Anthropic”

3K

88K

8K

12K

31M

Ivek

@Lord_ivek

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users