Today, we’re launching Gemma 4, our most intelligent open models to date. Built with the same breakthrough technology as Gemini 3, Gemma 4 brings advanced reasoning to your personal hardware and devices.
Here’s what Gemma 4 unlocks for developers:
— Intelligence-per-parameter: Our 31B (Dense) and 26B (MoE) models deliver state-of-the-art performance for their size, outcompeting models 20x their size on @arena
— Commercial flexibility: Released under a permissive Apache 2.0 license for complete developer flexibility and digital sovereignty
— Agentic workflows: Native support for function-calling and structured JSON output allows you to build reliable, autonomous agents
— Multimodal edge AI: The E2B and E4B models bring native vision, audio, and low latency to mobile and IoT devices
— Long-context reasoning: Up to 256K context windows allow you to process entire repositories or large documents in a single prompt
Whether you're building global applications in 140+ languages or local-first AI code assistants, Gemma 4 is built to be your foundation. Explore in @GoogleAIStudio or download the weights on @HuggingFace, @Kaggle, and @Ollama.
Software development is undergoing a renaissance in front of our eyes.
If you haven't used the tools recently, you likely are underestimating what you're missing. Since December, there's been a step function improvement in what tools like Codex can do. Some great engineers at OpenAI yesterday told me that their job has fundamentally changed since December. Prior to then, they could use Codex for unit tests; now it writes essentially all the code and does a great deal of their operations and debugging. Not everyone has yet made that leap, but it's usually because of factors besides the capability of the model.
Every company faces the same opportunity now, and navigating it well — just like with cloud computing or the Internet — requires careful thought. This post shares how OpenAI is currently approaching retooling our teams towards agentic software development. We're still learning and iterating, but here's how we're thinking about it right now:
As a first step, by March 31st, we're aiming that:
(1) For any technical task, the tool of first resort for humans is interacting with an agent rather than using an editor or terminal.
(2) The default way humans utilize agents is explicitly evaluated as safe, but also productive enough that most workflows do not need additional permissions.
In order to get there, here's what we recommended to the team a few weeks ago:
1. Take the time to try out the tools. The tools do sell themselves — many people have had amazing experiences with 5.2 in Codex, after having churned from codex web a few months ago. But many people are also so busy they haven't had a chance to try Codex yet or got stuck thinking "is there any way it could do X" rather than just trying.
- Designate an "agents captain" for your team — the primary person responsible for thinking about how agents can be brought into the teams' workflow.
- Share experiences or questions in a few designated internal channels
- Take a day for a company-wide Codex hackathon
2. Create skills and AGENTS[.md].
- Create and maintain an AGENTS[.md] for any project you work on; update the AGENTS[.md] whenever the agent does something wrong or struggles with a task.
- Write skills for anything that you get Codex to do, and commit it to the skills directory in a shared repository
3. Inventory and make accessible any internal tools.
- Maintain a list of tools that your team relies on, and make sure someone takes point on making it agent-accessible (such as via a CLI or MCP server).
4. Structure codebases to be agent-first. With the models changing so fast, this is still somewhat untrodden ground, and will require some exploration.
- Write tests which are quick to run, and create high-quality interfaces between components.
5. Say no to slop. Managing AI generated code at scale is an emerging problem, and will require new processes and conventions to keep code quality high
- Ensure that some human is accountable for any code that gets merged. As a code reviewer, maintain at least the same bar as you would for human-written code, and make sure the author understands what they're submitting.
6. Work on basic infra. There's a lot of room for everyone to build basic infrastructure, which can be guided by internal user feedback. The core tools are getting a lot better and more usable, but there's a lot of infrastructure that currently go around the tools, such as observability, tracking not just the committed code but the agent trajectories that led to them, and central management of the tools that agents are able to use.
Overall, adopting tools like Codex is not just a technical but also a deep cultural change, with a lot of downstream implications to figure out. We encourage every manager to drive this with their team, and to think through other action items — for example, per item 5 above, what else can prevent a lot of "functionally-correct but poorly-maintainable code" from creeping into codebases.
We just published a new AI-Native Engineering Team guide based on what engineering teams are asking for as they adopt Codex and the new GPT-5.1-Codex-Max model.
It covers:
🧩 How coding agents fit into each phase of dev across planning, design, maintain
🧰 Practical checklists and setup patterns you can use right away
📈 How to introduce agents into an org and scale as teams build trust
Read the guide 👉 https://t.co/yyM3aIF2Vo
Satya Nadella on why Microsoft Excel has been so durable after 40 years:
> the power of lists and tables
> the malleability of the software (“a blinking canvas”)
> spreadsheet software is Turing complete (“I can make it do everything”)
> it’s the world’s most approachable programming environment (“you get into it without even thinking your programming”)
These “OpenAI tokens” are not OpenAI equity. We did not partner with Robinhood, were not involved in this, and do not endorse it. Any transfer of OpenAI equity requires our approval—we did not approve any transfer.
Please be careful.
// i lead model behavior at openai, and wanted to share some thoughts & nuance that went into setting policy for 4o image generation.
features capital letters (!) bc i published it as a blog post:
--
This week, we launched native image generation in ChatGPT through 4o.
It was a special launch for many reasons — one of which our CEO Sam highlighted as "a new high-water mark for us in allowing creative freedom."
I wanted to unpack that a bit, as it could be easily missed by those not deep in AI or closely following our evolving thoughts on model behavior (wh… what do you mean you haven’t read the sixty-page Model Spec in your free time??).
tl;dr we’re shifting from blanket refusals in sensitive areas to a more precise approach focused on preventing real-world harm. The goal is to embrace humility: recognizing how much we don't know, and positioning ourselves to adapt as we learn.
Images are visceral
There's something uniquely powerful and visceral about images; they can deliver unmatched delight and shock. Unlike text, images transcend language barriers and evoke varied emotional responses. They can clarify complex ideas instantly.
Precisely because images carry so much impact, we felt even more heft — relative to other launches — in shaping policy and behavior.
Evolving perspectives on launching what feels like a new capability
When it comes to launching (what feels like) a new capability, our perspective has evolved across multiple launches:
1. Trusting user creativity over our own assumptions.
AI lab employees should not be the arbiters of what people should and shouldn’t be allowed to create. We’re always humbled after launch, discovering use cases we never imagined — or even ones that seem so obvious in hindsight but didn’t occur to us from our limited perspectives.
2. Seeing risks clearly, but not losing sight of everyday value to users.
It’s easy to fixate on potential harms, and broad restrictions always feel safest (and easiest!). We often catch ourselves questioning, “do we really need better meme capabilities when the same memes could be used to offend or hurt people?”. But I think that framing itself is flawed. It implies that subtle, everyday benefits must justify themselves against hypothetical worst-case scenarios, which undervalues how these small moments of delight, humor, and connection genuinely improve people’s lives.
3. Valuing unknown, unimaginable possibilities.
Maybe due to our cognitive bias against loss aversion, we rarely consider the negative impacts of inaction; some people refer to it as “invisible graveyards” although that’s a bit too morbid and extreme. There are second order or indirect impacts unlocked by a new capability: all the positive interactions, innovations, and ideas from people that never materialize simply because we feared the worst-case scenario.
How we thought about policy decisions for Day 1
Navigating these challenges is hard, but we aimed to maximize creative freedom while preventing real harm. Some examples from our launch decisions:
- Public figures: We know it can be tricky with public figures—especially when the lines blur between news, satire, and the interests of the person being depicted. We want our policies to apply fairly and equally to everyone, regardless of their “status”. But rather than be the arbiters of who is “important enough”, we decided to create an opt-out list to allow anyone who can be depicted by our models to decide for themselves.
- “Offensive” content: When it comes to “offensive” content, we pushed ourselves to reflect on whether any discomfort was stemming from our personal opinions or preferences vs. potential for real-world harm. Without clear guidelines, the model previously refused requests like "make this person’s eyes look more Asian" or "make this person heavier," unintentionally implying these attributes were inherently offensive.
- Hate symbols: We recognize symbols like swastikas carry deep and painful history. At the same time, we understand they can also appear in genuinely educational or cultural contexts. Completely banning them could erase meaningful conversations and intellectual exploration. Instead, we're iterating on technical methods to better identify and refuse harmful misuse.
- Minors: Whenever a policy decision involved younger users, we decided to play it safe: choosing stronger protections and tighter guardrails for people under 18 across research and product.
Ultimately, these considerations — coupled with our progress toward more precise technical levers — led us toward more permissive policies. We recognize this might be misinterpreted as "OpenAI lowering its safety standards,” but personally, I don’t think that does justice to the team’s extensive research, thoughtful debates, and genuine love & care for users and society.
My colleague Jason Kwon once passed onto me:
“Ships are safest in the harbor; the safest model is the one that refuses everything. But that’s not what ships or models are for.”
The future is built with imagination and adventure. As we continue our research and learn from society, we believe we can continue to find ways to responsibly increase user freedom. When (not if!) our policies evolve, updating them based on real-world feedback isn’t failure; that’s the point of iterative deployment.
Please keep sharing your feedback and creations — they genuinely help us improve!
A couple reflections on the quantum computing breakthrough we just announced...
Most of us grew up learning there are three main types of matter that matter: solid, liquid, and gas. Today, that changed.
After a nearly 20 year pursuit, we’ve created an entirely new state of matter, unlocked by a new class of materials, topoconductors, that enable a fundamental leap in computing.
It powers Majorana 1, the first quantum processing unit built on a topological core.
We believe this breakthrough will allow us to create a truly meaningful quantum computer not in decades, as some have predicted, but in years.
The qubits created with topoconductors are faster, more reliable, and smaller.
They are 1/100th of a millimeter, meaning we now have a clear path to a million-qubit processor.
Imagine a chip that can fit in the palm of your hand yet is capable of solving problems that even all the computers on Earth today combined could not!
Sometimes researchers have to work on things for decades to make progress possible.
It takes patience and persistence to have big impact in the world.
And I am glad we get the opportunity to do just that at Microsoft.
This is our focus: When productivity rises, economies grow faster, benefiting every sector and every corner of the globe.
It’s not about hyping tech; it’s about building technology that truly serves the world.
.@satyanadella on:
- why he doesn’t believe in AGI but does believe in 10% economic growth
- Microsoft’s new topological qubit breakthrough and gaming world models
- whether Office commoditizes LLMs or the other way around
Links below. Enjoy!
Timestamps
0:00:00 - Intro
0:05:48 - AI won't be winner-take-all
0:16:02 - World economy growing by 10%
0:22:23 - Decreasing price of intelligence
0:31:03 - Microsoft's Quantum breakthrough
0:43:35 - Microsoft's gaming world model
0:50:35 - Legal barriers to AI
0:56:30 - Getting AGI safety right
1:05:43 - 34 years at Microsoft
1:11:31 - Does Satya Nadella believe in AGI?
I don’t understand why the Canadian news outlets keep trying to make me care about Mark Carney or Chrystia Freeland’s thoughts. Mark is not in government.
I don't have too too much to add on top of this earlier post on V3 and I think it applies to R1 too (which is the more recent, thinking equivalent).
I will say that Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed in AI. You may not always be utilizing it fully but I would never bet against compute as the upper bound for achievable intelligence in the long run. Not just for an individual final training run, but also for the entire innovation / experimentation engine that silently underlies all the algorithmic innovations.
Data has historically been seen as a separate category from compute, but even data is downstream of compute to a large extent - you can spend compute to create data. Tons of it. You've heard this called synthetic data generation, but less obviously, there is a very deep connection (equivalence even) between "synthetic data generation" and "reinforcement learning". In the trial-and-error learning process in RL, the "trial" is model generating (synthetic) data, which it then learns from based on the "error" (/reward). Conversely, when you generate synthetic data and then rank or filter it in any way, your filter is straight up equivalent to a 0-1 advantage function - congrats you're doing crappy RL.
Last thought. Not sure if this is obvious. There are two major types of learning, in both children and in deep learning. There is 1) imitation learning (watch and repeat, i.e. pretraining, supervised finetuning), and 2) trial-and-error learning (reinforcement learning). My favorite simple example is AlphaGo - 1) is learning by imitating expert players, 2) is reinforcement learning to win the game. Almost every single shocking result of deep learning, and the source of all *magic* is always 2. 2 is significantly significantly more powerful. 2 is what surprises you. 2 is when the paddle learns to hit the ball behind the blocks in Breakout. 2 is when AlphaGo beats even Lee Sedol. And 2 is the "aha moment" when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc. It's the solving strategies you see this model use in its chain of thought. It's how it goes back and forth thinking to itself. These thoughts are *emergent* (!!!) and this is actually seriously incredible, impressive and new (as in publicly available and documented etc.). The model could never learn this with 1 (by imitation), because the cognition of the model and the cognition of the human labeler is different. The human would never know to correctly annotate these kinds of solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards a final outcome.
(Last last thought/reference this time for real is that RL is powerful but RLHF is not. RLHF is not RL. I have a separate rant on that in an earlier tweet
https://t.co/RMIpFPVpuM)
@stephenrobles I find these just don’t do the job. It gets too hot and over heats your phone too, which throttles the phone’s performance. Need to look for a qi2 one