What a year @GoogleAI (Dec 2022-Dec 2023)🚀Working with an amazing team all over the globe has been a highlight, impressed with how Gemini was built as a startup within Google. Been a unique rewarding experience with tons of learning along the journey. Another step forward in AI.
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on MMLU across 57 subjects with a score above 90%. It also achieves a new state-of-the-art score of 62.4% on the new MMMU multimodal reasoning benchmark, outperforming the previous best model by more than 5 percentage points.
Gemini was built by an awesome team of people from @GoogleDeepMind, @GoogleResearch, and elsewhere at @Google, and is one of the largest science and engineering efforts we’ve ever undertaken. As one of the two overall technical leads of the Gemini effort, along with my colleague @OriolVinyalsML, I am incredibly proud of the whole team, and we’re so excited to be sharing our work with you today!
There’s quite a lot of different material about Gemini available, starting with:
Main blog post: https://t.co/NzSycJl7aE
60-page technical report authored by th Gemini Team: https://t.co/CEdMRyYSLo
In this thread, I’ll walk you through some of the highlights.
A super long overdue (3+ years?) post on scaling laws.
Compute is expensive. Scaling laws are a way to help us reason about the optimal compute allocation between data and model size before committing to a large run.
The post covers what scaling laws predict, how compute-optimal allocation works, why Kaplan et al. and Chinchilla disagree, and how data limits + fitting details make extrapolation tricky.
https://t.co/HP26eJvjHB
Hear me out on a SF coffee chat idea:
Do a coffee chat with me, but whoever brings up AI, LLMs, AGI, ASI, frontier models, GPUs, semiconductors, NVIDIA, OpenAI, Anthropic, DeepMind, Meta AI, xAI, Claude, ChatGPT, Gemini, Grok, Llama, Qwen, DeepSeek, Mistral, agents, agentic workflows, prompts, prompt engineering, tokens, inference, CUDA, H100s, H200s, B200s, A100s, TPUs, Cerebras, Groq, ASICs, wafer-scale chips, TSMC, ASML, EUV, HBM, DDR5, data centers, servers, power constraints, nuclear energy for AI, cloud computing, Kubernetes, Docker, vector DBs, embeddings, RAG, context windows, long-context models, Transformers, state space models, Mamba, MoE, sparse attention, KV cache, quantization, distillation, RLHF, RLAIF, DPO, PPO, constitutional AI, synthetic data, data flywheels, post-training, pre-training, test-time compute, chain-of-thought, reasoning models, o1, o3, evals, benchmarks, SWE-bench, MMLU, GPQA, ARC-AGI, Humanity’s Last Exam, multimodality, VLMs, diffusion models, world models, robotics foundation models, autonomous driving, self-play, tool use, computer use, browser agents, AI coding, Cursor, Copilot, Codex, Devin, Claude Code, AI video generation, Sora, Veo, Runway, Midjourney, Stable Diffusion, scaling laws, Bitter Lesson, interpretability, mechanistic interpretability, alignment, AI safety, model collapse, hallucination, jailbreaks, prompt injection, AI wrappers, AI-native SaaS, AI replacing PMs, AI replacing engineers, AI fundraising, foundation model economics, inference margins, AI capex, NVIDIA market cap, tech stocks, IPOs, Silicon Valley, SF founder mode, SpaceX, data centers on the moon or Mars, Elon Musk, Jensen Huang, or Sam Altman…
buys coffee for the other person.
I don’t think it’s possible to make it through the whole conversation without saying any of these in the Bay Area, but worth experimenting
@_arohan_ At this point, we need to write a comprehensive guideline book on training spikes: different types, root causes, and mitigation methods for each..
Today at Google I/O, we introduced Gemini 3.5 Flash! It has become an integral part of our daily research cycle and works with all the tools we have at Google.
We used a team of agents in Antigravity 2.0 to recreate the original AlphaZero research paper and build a playable version. They coded the reinforcement learning pipeline in JAX/Flax, trained a ResNet model from scratch via self-play on multi-TPU pods, and shipped a full-stack web app so you can play against it, from just 2 prompts. .
Here’s what else makes 3.5 Flash special 🧵
The results of the research happening in my team @GoogleDeepMind have convinced me that the next era of scientific discovery will be aided by AI agents acting as force multipliers for human ingenuity.
That’s why I’m proud to introduce Gemini for Science - a collection of experimental science tools designed to support researchers at every stage of the research process. The tools include:
1️⃣ Literature Insights, built with Google NotebookLM, searches millions of scientific papers to synthesize findings and generate artifacts including data tables, slides, reports, and more.
2️⃣ Hypothesis Generation, built with Co-Scientist, simulates the scientific method via a multi-agent "idea tournament" to generate, debate, and rigorously evaluate research hypotheses.
3️⃣Computational Discovery, built with AlphaEvolve and ERA, is an agentic engine that generates and scores thousands of code variations in parallel, allowing researchers to test modeling approaches in fields like epidemiology in a fraction of the usual time.
Read more: https://t.co/l8XIg8iXCN
Register for access here: https://t.co/V3YS15mRUS
1/ Today at Google I/O, we’re launching Gemini 3.5 Flash ⚡️⚡️⚡️!
Our mission was clear: bring frontier-level intelligence with unprecedented speed.
3.5 Flash delivers drastic intelligence (beating 3.1 Pro on almost every benchmark), at Flash speeds. 🧵
in 3-4 years companies will be hiring INSANELY expensive consultants to unscrew their Mythos-created spaghetti critical infrastructure, which was 99.9% autonomous, until the 0.1% catastrophy hit
don't underestimate humans. We are amazing!
Did a very different format with @reinerpope – a blackboard lecture where he walks through how frontier LLMs are trained and served.
It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk.
It’s a bit technical, but I encourage you to hang in there - it’s really worth it.
There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him.
Recommend watching this one on YouTube so you can see the chalkboard.
0:00:00 – How batch size affects token cost and speed
0:31:59 – How MoE models are laid out across GPU racks
0:47:02 – How pipeline parallelism spreads model layers across racks
1:03:27 – Why Ilya said, “As we now know, pipelining is not wise.”
1:18:49 – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal
1:32:52 – Deducing long context memory costs from API pricing
2:03:52 – Convergent evolution between neural nets and cryptography