Happy CNY! We are glad to introduce our latest language model Seed-2.0. We make great progress (agent, reasoning, vision understanding, etc.) since Seed-1.8 without any distillation
Right now it's only available in CN now, and will soon be ready globally.
https://t.co/ghA2d8uvLy
Happy CNY! We are glad to introduce our latest language model Seed-2.0. We make great progress (agent, reasoning, vision understanding, etc.) since Seed-1.8 without any distillation
Right now it's only available in CN now, and will soon be ready globally.
https://t.co/ghA2d8uvLy
@Oli82817545 We did not add GUI data in post-training to avoid potential abuse of this capability. We are still exploring better ways to bring GUI capabilities to users.
Doubao becomes 1st Chinese AI app to reach 100m DAU (only counting China).
Volcano Engine recently reported that Doubao LLM token consumption has grown to 50T/day (3x May figures), so popularity of its text, image & video models are all very popular.
Proud to introduce Seed1.8, our latest generalized agent model
The model achieves competitive agentic capabilities, while maintaining high LLM/VLM scores, enjoy!
https://t.co/VZAdh5n5fP
Proud to introduce Seed1.8, our latest generalized agent model
The model achieves competitive agentic capabilities, while maintaining high LLM/VLM scores, enjoy!
https://t.co/VZAdh5n5fP
I just told my phone: "Play wordle and win."
It opened the app for the first time, read the rules, guessed a word, waited for an ad to play, and got the word in three attempts. Insane!
Another DeepSeek moment. This is the world’s first actual smart phone. It’s an engineering prototype of ZTE’s Nubia M153 running ByteDance’s Doubao AI agent fused into Android at the OS level. It has complete control over the phone. It can see the UI, choose/download apps, tap/type, call, and run multi-step task chains.
Here I just say (in English) “find someone to wait in line for me” (something you can do in China), and it picks which app to open, configures the job, and hands me one confirm screen. I wouldn’t otherwise know how to do this, and here the phone just did it in a matter of seconds.
🚀Introducing Lumine, a generalist AI agent trained within Genshin Impact that can perceive, reason, and act in real time, completing hours-long missions and following diverse instructions within complex 3D open-world environments.🎮
Website: https://t.co/UxSwNKGZml
1/6
OSWorld remains one of the best open-source evals
Real-world envs allow far more reward hacking, e.g., Claude 4.5 Sonnet often uses the terminal to solve GUI tasks instead of sticking to pure GUI interactions
But in 2025, if you trust benchmarks, you will have a tough time
We looked at OSWorld, a popular evaluation of AI computer use capabilities.
Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time.
See thread for details!
Check out Game-TARS, a generalized multimodal game agent. It's literally the best general game AI in the world, and it's very small~
Paper: https://t.co/WR8858XtfA
Blog: https://t.co/AbJ6Q2Qw1q
🚀 Thrilled to introduce Game-TARS: our next-gen generalist multimodal game agent!
Tired of AI that needs custom code for every new game?
Game-TARS is a single VLM that learns to master any game just like a human: by watching the screen and using a keyboard & mouse.
Read more.
*checks chatgpt* This paper costs ~4.2 million USD (400K GB200 hours) -- science!
Our most expensive run was a 100K GPU hour (same amount as Deepseek-R1-zero but on GB200s).
One finding here was that once we have a scalable RL algorithm, RL compute scaling becomes predictable (e.g., we extrapolated to 3x compute for a 17Bx16 MoE from 16k GPU Hours to 50k hours).
The other is when comparing algorithms, embrace the bitter lesson (try to predict how well it would scale with compute using a given performance curve, instead of just performance at a fixed compute).
Most algorithmic tricks in a scalable RL method don't change the asymptote performance, but things like model size, context length, batch size, and data does.
There are of course many design choices in RL, so we don’t think that the ScaleRL recipe is the end of the story.
🚨Variational Reasoning for Language Models🚨
We show how treating thinking traces as latent variables unlocks a principled, stable, and unified framework for training reasoning LLMs.
🚀LLMs can learn directly from verbal feedback — no scalar rewards needed!
😥Scalar rewards compress rich feedback— “redundant but correct” vs “concise but typo-ridden” might both be 0.8
💡We propose to learn Feedback-Conditional Policy (FCP), an extremely scalable paradigm!