Another exciting development in our chips business as Meta has decided to bet big on Graviton, our leading CPU chip—committing to tens of millions of Graviton cores.
Agentic AI is becoming almost as big a CPU story as a GPU story. Complex multi-step orchestration, real-time reasoning, and code generation at scale is CPU-intensive work. And, our purpose-built Graviton5 instances deliver up to 33% lower latency between cores, which matters a lot for these kinds of workloads.
Meta has been a longtime AWS customer and one of our biggest users of Bedrock... looking forward to what they build with Graviton5. https://t.co/1YSQBuUgSy
@nikesharora Frontier models will soon generate attack vectors and edge cases at scales no fixed application-layer system can match. The real counter becomes adaptive defender models running continuous red/blue self-play for dynamic, real-time neutralization.
How do you see Palo Alto Networks’ edge sensors evolving to play a central role in that adaptive defender model layer? And how could PAN’s proprietary signals let it outperform frontier labs’ models like Mythos?
We're expanding our collaboration with Amazon to secure up to 5 gigawatts of compute for training and deploying Claude. Capacity begins coming online this quarter, with nearly 1 gigawatt expected by the end of 2026.
RLHF is evolving toward harness feedback.
We’ve spent the last few years duct-taping LLMs together with Python.
Prompts, retry loops, tool wrappers, control flow. Useful, but most of the logic lived in code, not in the model.
What’s changing is where the model learns from.
Pre-training gave models language.
RLHF (reinforcement learning from human feedback) grounded them in human judgment.
RLAIF (reinforcement learning from AI feedback) scaled that signal using models to evaluate models.
Now we are seeing a third source of feedback.
Harness feedback.
The source of feedback is expanding from humans, to models, to environments.
Think of a codebase with tests, a math verifier, or a sandbox where each step must actually work.
This is not a single reward at the end.
It is an execution trace:
A failed test
A compiler error
An invalid sequence of actions
A constraint violation
The model sees what happened at each step.
On the surface, this looks like standard RL with an environment. The difference is how much of the trajectory the model gets to see.
The environment exposes failure and progress step by step.
That changes what the model learns.
It learns which trajectories hold up inside a real system.
This shows up in both training and inference.
During training, the harness provides dense feedback over multiple rollouts.
During inference, the same environment validates steps and filters out bad paths.
The same harness shapes the model during training and constrains it during execution.
The unit of learning shifts from isolated outputs to full trajectories.
Each attempt, failure, correction, and completion contributes signal.
As this continues, more of the logic we currently write around models gets absorbed into the model itself.
Most agent failures are not visible in the final output.
As agents move from prototypes into production systems, the failure modes change.
Most evaluation setups still center on the final output. A few test cases, sometimes a judge model, and a check that the response looks reasonable.
That collapses most of the system.
What actually runs is a sequence of steps. Tool calls, intermediate states, and decisions that lead to a result. The response is a lossy projection of that sequence. Looking only at the output hides failures in how the result was produced.
Take a retrieval-based agent. It should retrieve documents, rank them, extract facts, and synthesize an answer. If a change causes it to skip retrieval and answer from internal weights, the output can still look correct on a small test set. The system has still regressed.
For most real tasks, you can’t tell if the system behaved correctly just from the answer.
The sequence needs constraints.
Some states should never occur. Invalid tool arguments, schema violations, inconsistent state.
Some steps must eventually happen. Retrieval, validation, and forward progress. Systems that stall or loop are failing even if individual steps look fine.
Some operations require ordering. Others allow partial ordering but still enforce dependencies.
These constraints define valid execution.
Each run can be captured as a trace of actions, tool calls, states, and ordering. Evaluation becomes checking whether that trace satisfies the constraints.
Did it avoid invalid states. Did it complete required steps and make progress. Did it respect ordering. Did it introduce unnecessary transitions.
Most regressions show up here, not as wrong answers, but as violations in execution.
One run is not enough. The same input can produce different valid sequences. Stability shows up in the distribution. Healthy systems converge to a small set of paths. Regressions show up as higher variance, longer paths, or increased probability of loops and non-termination.
Looking across runs is similar to self-consistency. You are sampling the system and observing how often it follows valid paths versus drifting.
When available, token-level signals can add another layer. Shifts in probability distributions (for example via logprobs) can indicate the model is operating under a different regime. This is an early warning signal, not a correctness signal.
The evaluation stack should reflect this. Deterministic checks first. Then execution path validation. Then distribution across runs. Semantic evaluation last.
Many systems invert this, which is why issues show up as vague quality problems instead of something you can point to.
Once instrumented at this level, regressions become local. You can identify which step was skipped, where progress stopped, or where behavior diverged. In production, systems that stall or loop tend to go unnoticed longer than wrong answers.
There is a common take that coding agents are limited because writing code is only ~20% of software engineering. This assumes the rest of the job isn’t code.
It is.
Modern systems are defined by code. Infrastructure, CI/CD, tests, evals, monitoring, and deployments are all code. The other 80% is just a different category of logic.
This is where agents often do better.
Not because they are smarter. Because the feedback loop is stronger.
Product logic is inherently ambiguous. Requirements are incomplete. Correctness is subjective.
But the systems layer is explicit. A test passes or fails. A pipeline runs or breaks. A deployment succeeds or rolls back. A metric crosses a threshold. The system tells you what happened.
Agents work best when they can iterate rapidly against clear constraints. That is exactly what the operational layer provides.
The usual intuition is backwards. Agents are not just for feature work. They are strongest where behavior is observable and correctness is continuously validated.
These layers are typically underbuilt. Not because they are hard, but because they are tedious and time-intensive.
So systems ship with thin tests, brittle pipelines, shallow observability, and manual operations.
Agents change that.
They make it cheap to build and maintain these layers properly. More tests. Better evals. Safer deploys. Deeper instrumentation.
Agents take that further. They don’t just write the code. They manage the execution of the system the code defines.
The parts we called “not coding” are exactly where they have the most leverage.
Most agent failures are not visible in the final output.
As agents move from prototypes into production systems, the failure modes change.
Most evaluation setups still center on the final output. A few test cases, sometimes a judge model, and a check that the response looks reasonable.
That collapses most of the system.
What actually runs is a sequence of steps. Tool calls, intermediate states, and decisions that lead to a result. The response is a lossy projection of that sequence. Looking only at the output hides failures in how the result was produced.
Take a retrieval-based agent. It should retrieve documents, rank them, extract facts, and synthesize an answer. If a change causes it to skip retrieval and answer from internal weights, the output can still look correct on a small test set. The system has still regressed.
For most real tasks, you can’t tell if the system behaved correctly just from the answer.
The sequence needs constraints.
Some states should never occur. Invalid tool arguments, schema violations, inconsistent state.
Some steps must eventually happen. Retrieval, validation, and forward progress. Systems that stall or loop are failing even if individual steps look fine.
Some operations require ordering. Others allow partial ordering but still enforce dependencies.
These constraints define valid execution.
Each run can be captured as a trace of actions, tool calls, states, and ordering. Evaluation becomes checking whether that trace satisfies the constraints.
Did it avoid invalid states. Did it complete required steps and make progress. Did it respect ordering. Did it introduce unnecessary transitions.
Most regressions show up here, not as wrong answers, but as violations in execution.
One run is not enough. The same input can produce different valid sequences. Stability shows up in the distribution. Healthy systems converge to a small set of paths. Regressions show up as higher variance, longer paths, or increased probability of loops and non-termination.
Looking across runs is similar to self-consistency. You are sampling the system and observing how often it follows valid paths versus drifting.
When available, token-level signals can add another layer. Shifts in probability distributions (for example via logprobs) can indicate the model is operating under a different regime. This is an early warning signal, not a correctness signal.
The evaluation stack should reflect this. Deterministic checks first. Then execution path validation. Then distribution across runs. Semantic evaluation last.
Many systems invert this, which is why issues show up as vague quality problems instead of something you can point to.
Once instrumented at this level, regressions become local. You can identify which step was skipped, where progress stopped, or where behavior diverged. In production, systems that stall or loop tend to go unnoticed longer than wrong answers.
It is hard to overstate how much the compute baseline has shifted recently. Not in the slow, incremental way. In a step function way.
Most people are underestimating this shift by an order of magnitude.
Agents crossed a reliability threshold. They complete multi step workflows with persistence, recovery, and validation loops. They do not just generate. They iterate until the task converges.
That changes the unit of work.
The unit is no longer a single inference call. It is generate, execute, validate, retry, branch, compare, converge. What used to be one deterministic path is now a structured search across many probabilistic ones.
And this is not just about software. Modern knowledge work itself is moving from execution to exploration.
Once work becomes exploration, compute multiplies by design. Every serious task fans out into parallel model passes and parallel execution environments. GPUs scale reasoning. CPUs scale execution. Billions of isolated sandboxes compile, test, simulate, and verify generated outputs. One agent becomes dozens of processes.
Efficiency gains will not reduce demand. They increase ambition. We expand the search space. We raise quality thresholds. We extend context. We keep agents running continuously in the background.
Within a few years, much of today’s frontier capability will run on high end personal racks and enterprise edge systems. That will not shrink cloud demand. It will add another active layer. Device, edge, and cloud will all remain engaged.
Compute is no longer a backend resource.
It is becoming the operating substrate of modern work.
And substrates do not shrink.
Software engineering is collapsing into model training
For decades, we “built” software. We wrote deterministic logic, handled every edge case manually, and treated code as a static artifact.
In the agentic era, we are moving from authoring to optimization. These systems don’t behave like traditional programs; they behave like training loops running at inference time:
1/ The codebase is the dataset: logs, traces, and prior attempts shape the distribution the agent operates in
2/ Evals are the objective: you don’t debug behavior; you tune the definition of “good” until it converges
3/ The sandbox is the environment: a safe space to execute, fail, and generate feedback
4/ Execution is the loop: each run is a forward pass; each failure is a signal that conditions the next attempt
Take a simple feature like document summarization. Instead of steering an agent step by step, you define constraints like “no hallucinations,” “key facts preserved,” and hard format requirements. Run it in a sandbox and let the loop do the work. Some signals are fuzzy, some are binary, but together they shape the behavior.
The real shift is this: we are no longer building the software ourselves, we are designing the system in which the agent converges to the right one
Weak evals create churn. Precise verification collapses the search space until the correct behavior becomes the only stable outcome
Software engineering is collapsing into model training
For decades, we “built” software. We wrote deterministic logic, handled every edge case manually, and treated code as a static artifact.
In the agentic era, we are moving from authoring to optimization. These systems don’t behave like traditional programs; they behave like training loops running at inference time:
1/ The codebase is the dataset: logs, traces, and prior attempts shape the distribution the agent operates in
2/ Evals are the objective: you don’t debug behavior; you tune the definition of “good” until it converges
3/ The sandbox is the environment: a safe space to execute, fail, and generate feedback
4/ Execution is the loop: each run is a forward pass; each failure is a signal that conditions the next attempt
Take a simple feature like document summarization. Instead of steering an agent step by step, you define constraints like “no hallucinations,” “key facts preserved,” and hard format requirements. Run it in a sandbox and let the loop do the work. Some signals are fuzzy, some are binary, but together they shape the behavior.
The real shift is this: we are no longer building the software ourselves, we are designing the system in which the agent converges to the right one
Weak evals create churn. Precise verification collapses the search space until the correct behavior becomes the only stable outcome
GPT-5.2 derived a new result in theoretical physics.
We’re releasing the result in a preprint with researchers from @the_IAS, @VanderbiltU, @Cambridge_Uni, and @Harvard. It shows that a gluon interaction many physicists expected would not occur can arise under specific conditions.
https://t.co/EAZhKWacsG
Claude Code and Codex serve as temporary bridges. These agents will become obsolete as models gain stronger reasoning, self-correction, faster inference, and lower token costs. In the future software paradigm, code becomes ephemeral, generated, validated, and discarded on demand, and the engineering bottleneck shifts from humans working with coding agents to write, build, and deploy services to direct intent-to-execution flows.
How LLMs “Think” by Scaling Compute
We often anthropomorphize large language models (LLMs). We say things like “the model reasoned step by step,” “its chain of thought is an internal monologue,” or “it figured out the answer the way a person would.” These metaphors are convenient but they create a false picture. What looks like reasoning is usually scaling computation, not the presence of a human-like mind.
Chain of Thought Is Not a Internal Human Monologue
Common belief
Chain of thought (CoT) reveals the model’s inner reasoning, as if it is consciously thinking through a problem.
Reality 1/ An LLM predicts the next token by running a forward pass through all transformer layers. 2/ A direct question with a simple arithmetic goal requires only a few output tokens. Total compute is roughly O(L * d^2 * n_layers) per token, where L is context length and d is model width (the hidden dimension size, representing the number of features or channels processed in each layer, which affects the model’s capacity and computational complexity). Fewer tokens mean fewer passes and lower total compute. 3/ A CoT prompt - as introduced in Wei et al.’s 2022 paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” - forces a longer explanation. Each new token triggers another forward pass, and self-attention recomputes across all previous tokens, increasing compute to about O(T * L * d^2 * n_layers) where T is the number of generated tokens. (Note: This applies to standard Transformers; optimizations like sparse attention in models such as Longformer or BigBird reduce quadratic attention costs.) 4/ Every pass lets attention weights form new combinations of earlier information, effectively giving the model multiple refinement steps on intermediate calculations. More tokens -> more passes -> more opportunities for error correction and implicit search.
Fruit Puzzle Example
“You have five apples costing five cents each, four oranges at three cents each, and three bananas at two cents each. You eat two apples, one orange, and one banana. How many pieces of fruit remain and what is their total value?”
A direct-answer prompt (“Give the final number of fruit and total value”) might produce only two tokens. The model performs two forward passes and two rounds of self-attention - just enough to retrieve a memorized pattern. A CoT prompt (“Explain your steps before giving the final count and value”) could generate dozens of tokens. Each token requires a fresh forward pass where attention spans all prior tokens, repeatedly mixing partial sums and counts. The extra compute effectively acts like iterative refinement, letting the model track quantities and reduce mistakes without adding any new algorithm.
Strawberry Counting Example
“How many r’s are in strawberry?”
A direct-answer prompt might output “2” (wrong), as the model pattern-matches without breaking it down. One key issue: tokenization like Byte Pair Encoding (BPE) treats “strawberry” as subwords (e.g., “straw” + “berry”), not individual letters, so it can’t count characters within tokens. A CoT prompt (“Think step by step: Spell out strawberry: s-t-r-a-w-b-e-r-r-y. Now count the r’s.”) generates more tokens, enabling explicit tallying across passes: position 3 (r), 8 (r), 9 (r). Result: “3” (correct). The added compute and decomposition refines the count.
Why This Matters at Every Layer
Understanding that longer outputs mean more forward passes and self-attention is useful no matter how you work with LLMs:
- Model builders can target training budgets, token lengths, and architecture choices for better cost-accuracy trade-offs.
- Fine-tuners and alignment teams can design prompts and reward models that deliberately allocate or limit compute.
- API users can choose prompt styles (short answers vs. detailed reasoning) to balance latency, price, and accuracy.
- Open-source deployers can set context windows and quantization levels knowing how token count drives FLOPs and memory.