The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees.
The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance.
Access to all other Claude models is not affected.
We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible.
Read our full statement: https://t.co/bwn0sximKZ
Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?
Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.
My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains.
With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations.
The result is both impressive and sobering.
Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance.
On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate.
The age of useful agents is here.
The age of truly job-ready agents is not.
We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains.
🧵
AI駆動開発における「Human in the Loop(HITL)」から「Human on the Loop(HOTL)」へのパラダイムシフトと、それを支える統治構造(三権分立モデル)について学べます。
・HITLとHOTLの違いと、HOTLを実現するための「Harness Engineering」の考え方 ・三権分立モデル(立法・司法・行政)によるAI統治構造の設計方法 ・Authority Provenance Graph / Specification Provenance Graphを活用した、機械可読なルール管理・検証の仕組み
AIエージェントを自律的に動かすための組織設計・ガバナンス構築の具体的なヒントが得られ、実際の開発プロセスに落とし込んで開発速度と品質の両立を図れます。SSOTやグラフベースの知識管理の考え方は、プロダクト開発以外の領域にも応用可能です。
ビズリーチが月間2,300億トークン規模で実際にAIを活用している実践知見に基づいており、理論だけでなく、組織変革のための具体的なチェックポイントや実装レベルの提案が豊富です。
https://t.co/KtbTub605I
John Maeda氏の「From UX to AX」を読んだ。ここでいうAXはAgentic Experienceのこと。
これまでのUXデザインは画面や導線を考える仕事だったけど、これからは、「AIにどこまで任せる?」「最後は人が判断する?」みたいなことを設計する時代になる...と面白い視点だった。
https://t.co/YY5Pq3DpNM
My First purchase of SpaceX will be in 10 months.
All IPOs trade in a similar trend.
Shocking stats:
- Most IPOs drop 50% after going live. Look at $CAVA $RDDT $ALAB $CRWV $CART $CBRS
- Some drop further to 70-80%, look at $HOOD $PLTR
- And some never recover: $MBLY $CRCL $KLAR
Recently, we purchased one of each Anthropic/OpenAI subscription plan and randomly ran long horizon coding tasks until we exhausted the weekly limit. It's widely believed that a $200/month plan maxes out at ~$2000/month worth of tokens (assuming API pricing). However, we found that the subscriptions are actually far more generous. (2/4)