Phoenix 13.0
Phoenix 13 is a major release centered around Dataset Evaluators, a new system that turns your datasets into reusable evaluation suites. This release also introduces custom model providers, OpenAI Responses API support, and dozens of Playground and experiment UX improvements.
"Don't trust. Evaluate."
@nearestnabors set out to replace Claude Sonnet with Gemma 4. The evals showed a quantifiably better option.
Full walkthrough: capability evals + prompt engineering to ship a local 3B that matches Sonnet, 2x faster, $0/call.
Built with Phoenix.
https://t.co/UjE4zdiR6X
Our own Laurie Voss, head of Developer Relations, will be speaking at QDrant's Vector Space Day conference!
Most teams ship retrieval systems by tweaking the chunking, running a few demo queries, and calling it done. "Looks good to me" is not an evaluation strategy, but it's the one the industry has quietly agreed on.
Laurie's talk will cover the retrieval metrics that actually matter, how to build golden datasets that survive contact with reality, where LLM-as-judge helps and where it quietly lies to you, and how to wire continuous evals into your CI pipeline so regressions show up before your customers do.
Come with your skepticism. Leave with a playbook!
Vector Space Day is a full-day single-track conference for engineers at The Midway, San Francisco on June 11. Tickets at https://t.co/51ZJUhvLrd
Phoenix now lets you compose evaluation strategies in code.
Most eval tooling hands you a fixed menu of judge templates. Real evaluation is rarely that tidy.
Code Evaluators enable you to build evaluation criteria the way you want. You write a Python or TypeScript evaluate() function in the Phoenix UI — no SDK, no local runtime, no deploy step — and Phoenix runs it server-side, recording labels and scores as annotations on every experiment run.
Because it's just code, you control the whole strategy:
• Composite scoring: blend sub-scores (LLM judgment + deterministic rules) into one weighted metric
• Embedding-based evaluation: cosine similarity over embeddings instead of brittle string matching
• LLM juries: poll multiple models and combine verdicts into a weighted consensus
Sandboxed Code evaluators unlock the idea of agents as a judge as well. We're excited where this is heading.
https://t.co/6qIQNzrjvW
The Arize DevRel team wants to connect with Phoenix users like you. What you're tracing, what's working, what's rough?
Schedule time with the team here: https://t.co/2bb4FQiJh9
Something we’ve been playing with and liking a lot:
Give every coding agent its own observability stack.
Because Arize Phoenix can run fully local-first and air-gapped on your computer, each coding agent can get its own Phoenix instance: its own port, its own SQLite DB, its own traces, its own evals.
That means every agent working in its own worktree can observe what it did, inspect its traces, run evals against its changes, and use that feedback loop to self-verify before handing work back.
A private loop for every agent:
code → trace → evaluate → improve → verify
This makes it possible to scale many coding agents locally without cross-talk, shared state, or interference between neighboring work.
The bigger idea is that agents should not just generate work. They should be able to measure and validate their work continuously.
Local-first observability makes that practical.
A comprehensive 2-hour evaluations workshop, for free!
At AI Engineer: Europe, head of DevRel Laurie Voss gave this workshop that covers:
- What is an eval?
- Why are they important?
- How and why to manually examine the data
- Using built-in Phoenix evals
- Writing custom evals
https://t.co/NjkWF49Eym
Phoenix 15.5 → 15.7
🏷️ Note identifiers for upsert semantics (15.7)
POST /v1/{trace,span,session}_notes accepts an optional identifier; repeat writes overwrite in place. This is what makes notes safe to write from automation — a nightly evaluator keyed by identifier="qa-bot-v3@<id>" re-runs idempotently.
Closes the asymmetry with annotations, which already had this semantics.
⌨️ Bulk annotation delete + px project get (15.7)
When a judge LLM ran on a bad prompt for four hours and wrote 12k bad labels, the recovery path used to be a hand-written DB script. Now:
px span-annotations delete --identifier <bad-run> --start-time ... --end-time ...
Scoped delete by annotator identifier and time range — exactly the operation you need to roll back a misfired eval pass. px project get <name> lands alongside, unblocking project lookups in CI and shell pipelines.
🎯 OTLP project routing via x-project-name header (15.5) — the architecturally important one
Previously, routing a service's spans to a Phoenix project required setting the project name in the SDK init of every instrumented service. That's cross-cutting routing config buried inside application code.
The new header is read at OTLP ingestion and overrides the https://t.co/XZA0lbhdrK resource attribute. The consequence is that the OpenTelemetry Collector becomes a first-class routing layer. Now routing possible via a header. Great for things like OpenClaw and Daytona.
🎨 Playground default provider + model (15.6)
Persisted per-browser on the AI Providers settings page. If you iterate on prompts in Playground, this stops the model picker from resetting on every new session.
Release notes: https://t.co/I2aF193QEA
Repo: https://t.co/rsKeegSVwl
Agents are getting distributed.
One “agentic workflow” can span:
app server → MCP tool server → another agent (A2A) → ACP runtime → vector DB → LLM provider → internal APIs
So the “agent” isn’t one process or one log stream anymore.
It’s a graph across network/runtime/vendor/protocol boundaries.
That’s why distributed tracing is becoming essential for AI systems.
Logs tell you what happened inside *one* component.
They rarely reconstruct the full topology end-to-end.
Tracing gives the connective tissue:
- Which agent called which tool?
- Which tool called which service?
- Which model call depended on which retrieval step?
- Where did latency enter?
- Where did the failure begin?
As MCP/A2A/ACP make systems more modular, observability has to follow.
Propagate context across every boundary.
Correlate everything back into one trace.
In distributed agent systems, the hard question isn’t “what happened?”
It’s: how did the request move through the system—and where did it go wrong?
.@Chi_Wang_ spent the last few years pushing the boundaries of what agents can be, from AutoGen's multi-agent vision to today's frontier systems.
He's bringing 'Frontier of Agentic AI' to Observe: AI that writes code, runs 24/7, coordinates agent teams, and speaks to you in real time.
If you're building agents and the whole field feels like a moving target, join us at Observe and catch this talk!
June 4, SF: https://t.co/NYEih97lij
Evals should help you ship FASTER.
They are not there to gatekeep . They should be an accelerant: a way to move faster with more confidence.
That only works if feedback is fast and accurate.
When evals are slow, teams run them less often, test fewer cases, and push quality checks to later.
The Phoenix eval executors are designed around speed.
Under the hood, Phoenix uses parallel workers, bounded queues, retries, rate-limit handling, and dynamic concurrency to keep eval runs moving close to provider throughput limits.
The alternatives are usually worse: run sequentially and wait, manually crank concurrency until you hit 429s, or rebuild queueing and retry logic yourself.
Run the eval. Get trustworthy signal. Catch regressions sooner. Ship with more confidence.
We at @arizeai shipped an OTel middleware for @tan_stack AI a few weeks ago https://t.co/lc14JBNaN9
@ arizeai/openinference-tanstack-ai ships otel traces for your LLM calls directly to the backend of your choice, adhering to the Openinference Specification https://t.co/IzA4KxWbXb
I'm super glad to see first party OTel integration with TanStack AI, excited to see what ideas we can share
The official tanstack AI Otel support is out! Looking for a OSS backend for traces, datasets, and replay? Check out our docs for how to seamlessly add debug-ability.
https://t.co/tmgBIwnKyA
Your AI agents are running blind in prod?
One line of otelMiddleware() and every chat, iteration, and tool call lands in your OTel backend with full GenAI
semconv attributes.
Vendor-neutral. Optional peer dep. Already shipped on @tan_stack ai.
https://t.co/EaHn21Ag9i
One of the oldest lessons in ML is still one of the most useful for working with LLM apps:
Don’t evaluate on the same data used to build.
Train/dev/validation/test splits exist for a reason. They help separate “this worked because it was tuned against it” from “this actually generalizes.”
The same practice maps naturally to agents, prompts, and evals.
A good dataset might include:
* a dev split for fast iteration
* a validation split for prompt/model selection
* a test split for final confidence
* a hard-examples split for the cases the system keeps failing
A single aggregate score over the whole dataset usually hides the thing that matters most: where did the system improve, and where did it regress?
Splits make experiments more targeted, more honest, and easier to compare over time.
Old ML discipline, very practical LLM engineering pattern.
https://t.co/cfbeRMpV2E
Who judges the evaluators?
When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow, or prompt did the right thing.
But that raises the obvious question: how do you debug and evaluate the judge?
In Arize Phoenix, every evaluator run is automatically traced via OpenTelemetry and sent to a dedicated Phoenix project. That means you can inspect exactly how your evaluator made its decision:
→ the input data
→ the exact prompt sent to the judge LLM
→ the model’s reasoning
→ the final score
→ execution timing, token usage, and cost
This is especially useful if you have a production agent because your evals need to evolve as well. It becomes increasingly important to check for systematic evaluator bias and to align evaluation with human judgment.
In the same way that your agent must improve over time, so must your evals.
Coding agents are growing ~10x year over year.
But most are still operating blind.
If agents are shipping to production, they need to be grounded in real infrastructure:
* Traces → what actually happened
* Evaluations → what’s correct vs broken
* Feedback → what to fix next
This is the shift to Harness Engineering.
Instead of guessing, coding agents:
* Pull context from systems like Phoenix
* Debug using real runtime data
* Propose fixes → ship → re-run → learn
The teams that win won’t have the best prompts.
They’ll have the best feedback loops.
Don’t trust coding agents—make them prove their work.
https://t.co/YTzRkf826k
A heavy shipping week for Phoenix (04.29 → 05.05).
Eight items landed. The four worth pulling out for anyone running production eval pipelines:
↓
→ Provider tools in Playground & Prompts
Anthropic, OpenAI Responses, Gemini, and Bedrock built-in tools — web search, code execution, computer use, grounding — now run end-to-end through Phoenix.
No internal allowlist. Paste the provider's tool JSON and Phoenix round-trips it verbatim. New provider tools work the day they ship.
Docs → https://t.co/q3PPBCCUoE
→ Filter-based annotation DELETE
Three new endpoints bulk-remove annotations by identifier, name, annotator_kind, or time range.
This closes a real loop for automated eval pipelines — tag on creation, roll back later. Until now you'd accumulate annotations with no surgical way to clear them.
REST API → https://t.co/8lS7Gkrny1
→ CLI named auth profiles
px profile create / use / list / show / edit / delete
Bundle endpoint + project + API key under a name; switch between Phoenix instances without re-exporting env vars. Existing PHOENIX_HOST scripts keep working.
CLI docs → https://t.co/UwyTevJHxC
→ Dataset upsert (breaking)
client.datasets.create_dataset() now defaults to upsert. Re-running with the same name merges examples into a new dataset version instead of returning 409. Pass stable id fields for deterministic in-place updates.
Datasets API → https://t.co/HzS2SIJiCd
⸻
Also landed: TanStack AI tracing middleware (Integrations → https://t.co/hqc3SulDlD), evals runtime capability detection for o1/o3/o3-mini/o4-mini, sessions annotation parity with spans/traces, and token counts + experiment metadata in REST.
Versions: arize-phoenix 14.17 → 15.4 · client 2.6 · evals 3.1 · otel 0.16.1
GitHub releases → https://t.co/GDkw5yxWfx
Full release notes → https://t.co/I2aF193QEA
Different ways to deploy Arize Phoenix — from local dev to production scale.
Phoenix is a lightweight, containerized platform for LLM tracing and evaluation. How you deploy it should match your team size, scale, and reliability needs.
Here are the most common patterns:
1. Single instance (SQLite)
Best for getting started quickly or local experimentation. No setup required.
2. Production with PostgreSQL
Recommended for teams. Enables concurrency, durability, and better performance.
3. Horizontal scaling
Run multiple Phoenix instances behind a load balancer, all connected to the same database.
4. Environment isolation
Separate instances for dev, staging, and prod to keep data clean and workflows safe.
5. Schema isolation
Share a single PostgreSQL database while isolating teams via schemas.
6. Per-developer setup
Each engineer runs Phoenix locally for fast iteration without impacting others.
7. Application sidecar
Deploy Phoenix alongside your app (e.g. Kubernetes pod) for simplified networking and lifecycle management.
Phoenix is single-tenant by design, so scaling across teams typically means deploying multiple instances.
Get started in minutes:
https://t.co/0ZIR3aNcmj
Explore deployment options:
https://t.co/voYsBVPmIW
If you’re running LLM apps in production, your observability stack matters. Phoenix is designed to scale with you.
Phoenix 15.4 now supports provider tools
A concrete example of why provider tools matter for grounding and tool-call quality.
Same prompt — "Who won Euro 2024?" — run side-by-side in Phoenix Playground against gemini-2.5-flash-lite. The only difference: one side has Gemini's google_search grounding enabled.
Without it: the model says the tournament hasn't been played yet, citing dates from its training data.
With it: Spain beat England 2-1 in the final.
The model itself didn't change. What changed is whether it had access to a hosted capability that reaches outside its training cutoff. These provider tools — things like google_search, web_search, file_search, code_interpreter — aren't things you can polyfill at the application layer. They live inside the provider.
Which is why being able to round-trip the exact vendor tool payload through Phoenix matters: pull the config off a production trace, paste it into a prompt, reproduce the behavior in the Playground. Easy A/B between grounded and ungrounded variants. Easy regression testing when a vendor ships a new built-in.
Phoenix now supports vendor tools across all major providers:
Anthropic: web_search, web_fetch, code_execution, tool_search_tool, bash, text_editor, computer, memory
OpenAI Responses API: web_search, file_search, code_interpreter, computer, tool_search (with deferred loading)
Google Gemini: google_search grounding
Amazon Bedrock: Nova web_grounding
Small change in tool config, large change in output quality.
Docs: https://t.co/ig2RuwAuOz
Release: https://t.co/L8wtRRhJMM
LLM applications are noisy by default. Non-determinism, open-ended inputs, and multi-step tool use create a flood of traces that all look “fine” — until something breaks.
The difference between noise and signal is annotation.
→ Inline annotations capture high-fidelity signals at the moment of inference: user feedback, guardrails, retries, cost, latency. This is your fastest path to narrowing the search space.
→ Post-hoc annotation turns observations into structure:
• Open coding surfaces real failure modes
• Axial coding organizes them into a shared taxonomy
• Evals scale that taxonomy into measurable, repeatable signals
Each layer builds on the last. Skip one, and everything above it weakens.
The result isn’t just better debugging — it’s a system your team can actually understand and improve.
Annotations aren’t overhead. They’re the mechanism that turns LLM systems into something operable.
Learn more about Phoenix: https://t.co/rsKeegSVwl