ぬ

@nkmry_

AI 関連のツイート多め。最近は LLM agent を研究してる． ML研究で博士 → JTC で自然言語処理の新規事業開発 → AI SaaS スタートアップ CTO → ？

Tokyo

Joined April 2013

1.6K Following

2K Followers

12.2K Posts

nkmry_ retweeted

Sakana AI

@SakanaAILabs

2 days ago

Fugu stands shoulder-to-shoulder with leading models like Fable and Mythos across the industry's most rigorous engineering, scientific, and reasoning benchmarks. Read the full blog: https://t.co/2ZJbdWqCUj Beyond Bigger Models: Why are Orchestration Models the Next Frontier Progress in AI has been driven largely by giant, monolithic models. But the most powerful systems of the future will be collaborative ecosystems. Today, this orchestration is no longer just a technical optimization. It has become a geopolitical and operational imperative. For an organization or a nation, relying on a single company's model for critical infrastructure, finance, or governance is a material vulnerability. This risk is no longer a hypothetical possibility, but a reality. As we have seen with recent export controls imposed on models like Fable and Mythos, access can disappear overnight. Collective intelligence is the practical hedge against this concentration of power. Because Fugu orchestrates an underlying pool of swappable agents, it simply routes around vendor restrictions. By orchestrating the world’s models, we are delivering the resilient blueprint required for true AI sovereignty.

SakanaAILabs's tweet photo. Fugu stands shoulder-to-shoulder with leading models like Fable and Mythos across the industry's most rigorous engineering, scientific, and reasoning benchmarks.

Read the full blog: https://t.co/2ZJbdWqCUj

Beyond Bigger Models: Why are Orchestration Models the Next Frontier

Progress in AI has been driven largely by giant, monolithic models. But the most powerful systems of the future will be collaborative ecosystems.

Today, this orchestration is no longer just a technical optimization. It has become a geopolitical and operational imperative.

For an organization or a nation, relying on a single company's model for critical infrastructure, finance, or governance is a material vulnerability. This risk is no longer a hypothetical possibility, but a reality.

As we have seen with recent export controls imposed on models like Fable and Mythos, access can disappear overnight.

Collective intelligence is the practical hedge against this concentration of power. Because Fugu orchestrates an underlying pool of swappable agents, it simply routes around vendor restrictions.

By orchestrating the world’s models, we are delivering the resilient blueprint required for true AI sovereignty.

469

ぬ

@nkmry_

3 days ago

日本では科学技術と一緒くたにされているけれど、本来は科学と技術は別物で、研究者と技術者の興味が違うのは当然だよね。逆に、AI を含む計算機科学は、作ること自体が現象の発見と理解に繋がるから、科学と技術の境界が曖昧で、研究者が実装や応用に踏み込み、技術者が研究課題を発見することが起きやすい

Kenn Ejima

@kenn

4 days ago

AI研究者ってこういう人多いですよね今のLLMに興奮してるのは研究者よりハッカー研究者はTransformerみたいにスケールするだけのフェーズに入ったものは理論的に面白い部分は残されてない＝自分の出番がなくてつまんないと感じてるヤン・ルカンが自己回帰モデル全般の悪口言うのも気持ちはわかるでもスケールするところからが社会に本格的な変革を起こすわけで… とはいえTransformerは性能的にはまだ上限ヒットしなさそうだけれども生命の神経系と比べればエネルギー効率は明らかに悪いので、Wetからの援用で色々なアイデアを持ってる研究者がLLMに見向きもせず更なる未来のフロンティアを探ってること自体については、いいぞもっとやれと思っている

709

113

270

256K

461

nkmry_ retweeted

John Jumper

@JohnJumperSci

5 days ago

A bit of news: After nearly 9 years, I have decided to leave Google DeepMind and join Anthropic (after taking some time to recharge). I am incredibly grateful for my time at GDM. @demishassabis took a real chance letting me lead the AlphaFold team just six months after finishing my PhD, and the entire GDM team taught me so much about how to do great science. GDM is a special place, and I’ll still be excited to hear about what amazing things they discover next.

612

14K

972

nkmry_ retweeted

OpenAI Developers

@OpenAIDevs

5 days ago

Show Codex a workflow once. Reuse it as a skill. Record & Replay lets you show Codex a recurring task, like filing an expense report or submitting a time-off request. Codex turns that demo into an inspectable, editable skill. You control when recording starts and stops.

508

13K

10K

Who to follow

福利厚生は間違いなく大手以上！🔥SES会社の代表の村井です。

@4YKAZur4W5Wfoy1

元東芝、ユニシス、オラクル、Redhat営業。給与安心の固定給（年一回昇給有）、リロクラブ加入、家賃補助、ランチ半額補助、資格取り放題、本購入し放題、退職金制度有、懇親会は高級店！帰社日無し！年2回チートデイあります🚀 コネクションを活かして元請直中心にフリーの皆様に優良案件を薄利でご紹介します🙇

Yuma Koizumi

@yuma_koizumi

Staff Research Scientist @GoogleDeepMind Tokyo 🇯🇵. Gemini for APAC speech research TL. Tweets are my own.

ばね＠ノーコード制作

@5FdDWespJX06Gvz

nkmry_ retweeted

今井翔太 / Shota Imai@えるエル

@ImAI_Eruel

6 days ago

生成AIの最重要研究であるTransformer論文の当時Googleの著者8人は「神8」やTransformer Eightなどと言われて特別扱いされ、その一人Noam ShazeerはTransformerの成功を「説明できず、神の慈悲によるもの」と書いたことで有名です。なんとそのShazeerがGoogleからOpenAIに移るというニュース。そもそもShazeerは以前にもGoogleを離れてCharacter. AIという会社を創業しており、Googleは実質的にこのShazeerを呼び戻すために数千億円を払っています。一体OpenAIがどれだけ金を積んだのか、あるいはShazeerを惹きつける隠し玉を持っているのか気になります。

ImAI_Eruel's tweet photo. 生成AIの最重要研究であるTransformer論文の当時Googleの著者8人は「神8」やTransformer Eightなどと言われて特別扱いされ、その一人Noam ShazeerはTransformerの成功を「説明できず、神の慈悲によるもの」と書いたことで有名です。
なんとそのShazeerがGoogleからOpenAIに移るというニュース。
そもそもShazeerは以前にもGoogleを離れてCharacter. AIという会社を創業しており、Googleは実質的にこのShazeerを呼び戻すために数千億円を払っています。
一体OpenAIがどれだけ金を積んだのか、あるいはShazeerを惹きつける隠し玉を持っているのか気になります。

181

308

176K

ぬ

@nkmry_

8 days ago

Space X が時価総額で Microsoft を超えたとな！

*Walter Bloomberg

@DeItaone

8 days ago

$SPCX - SPACEX OVERTAKES MICROSOFT TO BECOME THE FOURTH BIGGEST COMPANY BY MARKET VALUE

583

108K

162

nkmry_ retweeted

Bull Theory

@BullTheoryio

8 days ago

BREAKING: SpaceX has agreed to acquire Cursor, the world's fastest growing software startup, for $60 billion in an all stock deal. Cursor has over 1 million paying customers, more than $2 billion in annualized revenue, and is projected to hit $6 billion by end of 2026. At $60 billion, this is the largest software acquisition in history, paying 20 to 30 times Cursor's current revenue. The deal is subject to regulatory approval and expected to close in Q3 2026. SpaceX now owns the rockets, the satellites, the AI models, the chips, and is about to own the tool every developer on earth uses to write code.

BullTheoryio's tweet photo. BREAKING: SpaceX has agreed to acquire Cursor, the world's fastest growing software startup, for $60 billion in an all stock deal.

Cursor has over 1 million paying customers, more than $2 billion in annualized revenue, and is projected to hit $6 billion by end of 2026.

At $60 billion, this is the largest software acquisition in history, paying 20 to 30 times Cursor's current revenue.

The deal is subject to regulatory approval and expected to close in Q3 2026.

SpaceX now owns the rockets, the satellites, the AI models, the chips, and is about to own the tool every developer on earth uses to write code.

406

12K

nkmry_ retweeted

地震速報

@EqAlarm

8 days ago

[緊急地震速報] 19:47:14現在第13報警報発生時刻：19:46:33 震央情報：茨城県南部震源情報：36.1N 139.9E 60km 地震規模：M5.3 最大震度：5弱予想震度：2.8 猶予時間：主要動到達

549K

nkmry_ retweeted

Anthropic

@AnthropicAI

11 days ago

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: https://t.co/bwn0sximKZ

13K

88K

26K

24K

92M

nkmry_ retweeted

Dawn Song

@dawnsongtweets

13 days ago

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case? Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work. My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains. With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering. Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance. On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here. The age of truly job-ready agents is not. We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

$dawnsongtweets's tweet photo. Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case? Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work. My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains. With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering. Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance. On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here. The age of truly job-ready agents is not. We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵$

965

206

505

277K

nkmry_ retweeted

Arena.ai

@arena

13 days ago

Claude Fable 5 ranks #1 in Code Arena: Frontend, leading by a wide margin over Opus-4.8. Highlights: - #1 in every sub leaderboard: HTML, React - #1 in every sub category: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, Simulations, and Content Creation Tools. Huge congrats to @AnthropicAI for this milestone! The thread breaks down how Claude Fable 5 ranks across single-modality arenas.

133

243

487

421K

nkmry_ retweeted

Alex Albert

@alexalbert__

14 days ago

We've reset usage limits across our products! For those just starting to test Fable, here's four tips for using it more effectively: 1. Give it bigger, more ambitious tasks than what previous models could handle. 2. Use xhigh/high effort as your default for best performance, med for faster interactive sessions. 3. Rework your skills and CLAUDE.mds. Instructions written for prior models anchor Fable to stale patterns, let it use its own judgment first. 4. Move from providing tasks to providing objectives. Describe what done looks like and how to verify it, then let Fable find the path (/loop and /goal are built for this)

152

170

384K

nkmry_ retweeted

Andrej Karpathy

@karpathy

14 days ago

This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The benchmarks are great and it's SOTA on everything by a margin but I'll add that *qualitatively* also, this is a major-version-bump-deserving step change forward (imo of the same order as Claude 4.5 was in November), peaking especially for long problem-solving sessions on very difficult problems. You can give it a lot more ambitious tasks than what you're used to, the model "gets it" and it will just go, and it's never felt this tempting to stop looking at the code at all (but don't do this in prod!). The model still has quirks that people will run into and the safeguards are configured to be a little too trigger happy for launch, which can hopefully be tuned over time. I feel a lot of things changing as working software increasingly comes out on a tap. The Jevon's paradox kicks in and I feel my own demand for software growing substantially. You can ask for anything - explainers, visualizers, dashboards, bespoke single-use apps (e.g. a full wandb that is hyper-specific just for your project), you can 10X your test suite, auto-optimize code, run giant research projects with custom HTML for the results, anything! "Free your mind" (Matrix ref). Really looking forward to all the things people build!

25K

nkmry_ retweeted

Claude

@claudeai

15 days ago

Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use. Its capabilities exceed those of any model we’ve ever made generally available.

105K

15K

22K

57M

ぬ

@nkmry_

15 days ago

「文明は、私たちが考えることなく行える重要な行動の数を増やすことによって進歩する」

nkmry_ retweeted

池谷裕二

@yuji_ikegaya

16 days ago

【起きたまま睡眠!?】眠らなくても、睡眠中と同じ脳活動パターンを人工的に作り出せば、あたかも睡眠をとったかのように脳がリフレッシュ（睡眠圧が低下し記憶力も回復）するそうです。マウスでの実証。今日の『ネイチャー神経科学』誌より→ https://t.co/WDufcab6ze

ぬ

@nkmry_

18 days ago

阿呆の方が幸せだが、知識がないと不幸になる。博識な阿呆になる必要がある

nkmry_ retweeted

Arena.ai

@arena

20 days ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

151

334

593K

nkmry_ retweeted

Anthropic

@AnthropicAI

20 days ago

Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor. It’s happening faster than we thought, and the implications deserve greater attention. https://t.co/OVVPJO7VQx

29K

15K

19M

ぬ

@nkmry_

22 days ago

AI PC について考えるのは、フロンティアモデルの性能がサチってからでも遅くはない。オープンモデルがその性能に追いつくのに数ヶ月、さらにPC に載るほどの大きさに小型化・蒸留・量子化できるようなるのにも時間がかかるだろうから。

Macro_Lin ｜市场观察员

@LinQingV

24 days ago

为什么我一直觉得AI PC是个伪需求呢？这轮AI革命的核心规律是scaling law。模型到一定规模才涌现出智能，真正有用的模型必然是大模型，必然在云端。GPT-5、Claude Opus、DeepSeek V4，参数量数千亿到万亿级别，端侧跑得动哪个？有人会说N1X配128GB统一内存，可以跑70B的Q4量化模型。但能跑和好用完全是两件事。70B Q4在45-80W的功耗墙下，推理吞吐严重受限，模型能力本身和云端frontier model也差了一个量级。你花一台高端笔记本的钱，获得的智能水平远不如每月20美元的API订阅。更何况云端推理成本还在快速下降，DeepSeek把开源模型的推理效率卷到了新高度，端侧的性价比只会越来越难看。 N1X里的6144个CUDA core占用的die area相当可观。3nm工艺下，每平方毫米都是真金白银。这些面积如果换成更多CPU核心和更大缓存，对当下AI的实际使用场景反而更有价值。为什么这么说？因为当前AI的主流用法已经转向agent工作流。大量的工具调用、文件IO、代码执行、环境编排，全是CPU密集型任务。跑一个复杂的coding agent，你的瓶颈在CPU吞吐和系统IO，GPU算力反而是闲置的。隐私需求这个论点同样经不住认真审视。全球几亿人每天把最私密的问题交给ChatGPT和Claude，没有人因此转向本地部署。企业级隐私合规靠的是私有云部署，也轮不到一台高端消费级笔记本来承载。在模型能力和使用成本面前，隐私从来排不上多数人的优先级。

286

475

180

253K

237

ぬ

@nkmry_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users