Microsoft open-sourced RAMPART (Risk Assessment and Measurement Platform for Agentic Red Teaming) this week, and I'm already getting Slacks from platform leads about wiring it into GitHub Actions next to their pytest suite.
RAMPART is a real step forward. pytest-shaped, slots into CI, runs adversarial scenarios against AI agents on every code change. For design-time safety reviews, it's the right shape.
The mismatch shows up when those agents actually deploy. They run inside Kubernetes with cluster-bound service accounts and NetworkPolicies the CI runner doesn't have access to.
Passing RAMPART in GitHub Actions tells you how the agent behaves in a sandbox. The cluster it actually runs in is a different question.
An AI workload that fails in prod after passing in CI doesn't show up as a CI problem on a DORA dashboard. It shows up as a Change Failure Rate spike on the deployment that came two days later.
A platform team I worked with last quarter moved their AI-agent validation inside the cluster. Same tests, running against real cluster identity and real mesh routing. False-pass rate dropped to near zero because the test environment finally matched production.
Adversarial testing of agents only counts if it runs where the agent runs.
#Kubernetes #PlatformEngineering #AIAgents #DevSecOps
Your homegrown test runner is one vacation away from being unmaintainable.
Same pattern in three or four enterprise platform teams I've worked through this quarter. One engineer wrote the bash glue around Jenkins, owns the RBAC sidecar nobody else fully understands, and maintains the YAML that wires test runs into the deploy gate.
The team calls the platform "free" because the licensing line is zero. The loaded cost of a senior platform engineer running maintenance doesn't show up anywhere on the engineering P&L.
If that person leaves the cluster keeps running. The tests don't.
At Testkube we see this most often in orgs that scaled past 500 engineers before anyone wrote down how the test infrastructure actually works. The bus factor on a test platform is the cheapest risk nobody measures.
#PlatformEngineering #Kubernetes #DevOps
I had a call with a Director of Platform Engineering at a large retailer last week. They can ship 11 GenAI services into Kubernetes but can only redeploy two of them more than once a week.
Their infra is fine. GKE, gVisor sandboxes, model serving with autoscaling. CNCF just launched the Kubernetes AI Conformance Program and they'd pass it on most boxes.
The bottleneck is testing.
Every redeploy means re-validating against real K8s identities, real NetworkPolicy, real mesh routing under load. Their CI can't do that from outside the trust boundary. So every model push becomes a 2-hour bespoke validation sprint led by one staff engineer.
CNCF says 66% of orgs are running GenAI workloads on Kubernetes. Only 7% are deploying to production daily.
The platform layer is now AI-conformant. The validation layer still runs outside the cluster on infra it can't fully see. That's where deploy frequency dies.
Built a 3-min diagnostic that puts your team on the K8s testing maturity curve and quantifies what the gap is costing per year. Link in the first comment if you want it.
#kubernetes #platformengineering #devops #ai
Every integration test that mocks the network is testing your imagination, not your system.
A mock is your best guess about what a dependency does. The retry logic. The timeout. The auth handshake. The sidecar that rewrites a header you forgot was there. Production doesn't read your guesses.
So the suite goes green, the deploy ships, and the first time your code meets real mesh routing and real RBAC is in prod. With customers on it.
And the more faithful the mock looks, the more confident you get in a model of your system that nobody ever checked against the system.
Kubernetes makes it sharper. The things most likely to break (service-to-service identity, NetworkPolicy, mesh retries, in-cluster DNS) are exactly what a mock can't reproduce. You can't fake topology.
Mocking harder won't save you. Running the test where the workload actually runs will, with the same identity and routes it gets in prod. That's the bet we made building Testkube.
A green suite full of mocks proves your assumptions agree with each other.
It says nothing about whether they're true.
#Kubernetes #PlatformEngineering #DevOps #SoftwareTesting #SRE
A platform team I talked to last week was quoted six figures to load-test a single Black Friday weekend.
Billed per virtual-user-hour by a SaaS vendor, to generate traffic from outside their cluster.
That quote was really paying for one thing: a way to push load across the trust boundary. Firewall holes, privileged tokens, and a run that hits a staging copy instead of the system that actually serves customers.
They ran it the other way. k6 inside their own Kubernetes cluster, on compute they already pay for. Real routing, real mesh, real pod scheduling under pressure. About 20 ephemeral pods, spun up for the run and gone after.
Roughly 70% cheaper than the quote. The bigger win was the signal: the cluster behaved under load the way it would on the actual day, because it was the actual cluster.
This is the pattern we keep seeing at Testkube. The expensive part of load testing in CI or on a SaaS platform was never the tooling. It was paying to simulate a system you could have tested in place.
The vendor quote was the easy thing to turn down. The harder part was admitting they'd been buying confidence in an environment that never matched production anyway.
#Kubernetes #DevOps #PlatformEngineering #LoadTesting #CloudNative
Your GitHub Actions runner has more privilege inside your Kubernetes cluster than any production workload.
Think about how it ended up that way. To run tests against the cluster, CI needs cluster-admin credentials or close to it. The secret gets minted, dropped into a runner env var, and shipped out to a build environment that lives outside the trust boundary. That runner can now do more inside your cluster than the pod it's testing.
Your production workloads run with workload-bound identity, RBAC scoped to their namespace, mesh policy, NetworkPolicy. Your CI runner runs with a kubeconfig.
The credential is a symptom. The real architectural problem is that test execution lives outside the trust boundary it's supposed to validate. Anything CI sees is a synthetic view of the cluster through a privileged tunnel.
We built Testkube to put execution inside the cluster. Tests run as in-cluster workloads with the same identity, RBAC, and mesh path as the things they're testing. The CI runner stops being a privileged outsider. It becomes a trigger.
Most platform teams already know this. They haven't fixed it because the fix means changing where tests run, and nobody has budget for that conversation yet.
#Kubernetes #PlatformEngineering #DevOps #DevSecOps
AI models are already good enough.
The workflow isn’t.
Opera Neon changes that forever.
— Native MCP + CLI connectors
— Agents inside your browser session
— Real page context + live tabs
— Every major model in one place, for one price
— Tasks that run while you’re away
No screenshots.
No copy pasting.
No broken thread.
The browser finally caught up to the models.
Watched a senior engineer run a Kubernetes regression suite from inside Cursor yesterday.
They typed: "Run the smoke tests for the checkout service against staging." Got back a list of failed assertions, the full pod logs, and a one-paragraph root-cause summary. Eight seconds.
No tab switching. Nothing pasted into a CI config. No pipeline spin-up.
This is the part of testing nobody is talking about yet. AI dev tools are pulling cluster context, test results, and failure artifacts into the IDE through MCP.
The IDE becomes the interface for both code and the cluster.
Two things shifted in their workflow. The time to write and validate a fix dropped because the feedback loop was tight. The pressure to "just push it and see" went away because the local result was actually accurate.
CI still runs the merge gate. But the iteration loop moved out of the pipeline and into the editor, sitting on top of real cluster state.
If your developers are pasting kubectl output into ChatGPT to debug failing tests, you're already paying for this workflow. You just don't own it.
Is the IDE pulling test context for your team yet, or are devs still bouncing across five tabs?
#kubernetes #devtools #platformengineering #aitools
The expensive part of a flaky test isn't the rerun.
It's the decision you make six months later, when a test fails and you ship anyway because "it's probably just flaky."
I sat with a platform team last week walking through their CI history. 42% of their Q1 failed builds were marked "rerun, passed." Nobody investigated.
Then a real failure slipped through. P1 incident, 6 hours of recovery. The test had been failing for nine days. Everyone assumed it was the same flake.
That's the actual cost. Not the wasted compute. Not even the engineer time. It's the slow erosion of trust in your test signal, which means every failure becomes ambiguous, and every ambiguous failure trends toward "ship it."
Most teams treat flakes as a hygiene problem. Add retries. Quarantine the worst offenders. Move on.
But the underlying issue is usually environment. Tests passing on a CI runner with mocks, no service mesh, no network policies. Then failing intermittently in the actual cluster, because of conditions the CI never sees.
You can't fix a trust problem with retries.
#Kubernetes #DevOps #PlatformEngineering #TestAutomation
Spent the last few weeks mapping what good looks like for K8s testing across enterprise platform teams.
Most teams running 30+ clusters in production are still at level 2 of 5.
The L2 pattern looks like:
Regression cycle measured in days
CI runner fleet costs more than the apps it tests
Nobody can answer "what changed since the last green build" without digging through six dashboards
Security finds out about CI's cluster credentials during the next audit, not before
L4 looks different. Tests run inside the cluster on real identity, real routing, real mesh paths. Regression cycles drop from days to hours. CI compute spend drops 40%, and the platform team stops being the bottleneck on every release.
The technical work to get to L4 is well-understood. The real blocker is funding. Engineering leaders consistently underestimate the annual cost of staying at L2 by 5x to 10x.
Built a 3-min K8s Testing Maturity diagnostic. 10 questions, no email, scores your team L1-L5 with a real dollar figure on the gap. Comment "level" or DM and I'll send it.
#Kubernetes #PlatformEngineering #DevOps #Testing
Shift-left was the right answer to the wrong question.
It told us when to test. It never told us where.
Earlier feedback was a real win. Faster signal saved real bugs. But the architecture problem stayed put. Every test still ran outside the system it was meant to validate. Outside the cluster, far from the RBAC and NetworkPolicies and sidecars the app actually depends on.
You can ship a unit test in 200ms and still be six layers of architecture away from production. The test passes, the release ships, and something breaks at 3:14am that no environment outside the cluster was ever going to catch.
The next move is location, not timing. Test inside the system you're shipping.
That means service identity bound to the actual workload. Mesh routing that resolves the way it does in prod. Topology pulled from live cluster state, not a yaml file pretending to be one.
In two years I think we look back on in-cluster testing the way we look back on containers. Obvious in hindsight. Late to most teams.
#kubernetes #devops #platformengineering #sre
The most expensive line item in your engineering budget isn't on any line item. It's the time your team spends triaging flaky tests, chasing environment access, and writing postmortems for failures CI couldn't catch. None of that's a line item. But it eats real engineering hours.
Three calls in the last two weeks, same pattern. Engineering leaders can quote DORA metrics. Ask what the testing gap is costing in dollars and the room goes quiet.
Usually six figures. For a 1,000+ engineer org with K8s in production, mid-six is the floor.
Most teams sit between L2 (CI-bound) and L3 (Hybrid). They feel like L4. The dollar number on the gap is what makes the conversation real with finance.
If you're trying to make the K8s testing case to a CFO, you need both. Where you sit on the curve, and what the gap is worth.
Built a 3-minute diagnostic that does both. 10 questions, no email. Comment "level" or DM me and I'll send it.
#PlatformEngineering #DevOps #Kubernetes
Watched a team last week celebrate a "clean canary rollout" that took down checkout 90 minutes later.
Dashboards green. Canary metrics fine. Traffic ramp perfect.
Then prod started returning 500s on a code path nobody had run in cluster.
Canaries tell you something is breaking in production. They don't tell you whether the deploy is safe to ship.
If your release strategy is "watch the metrics during the ramp," you're running an incident drill. Hoping to catch the failure before customers do.
The deeper issue: nobody validated the new code path against the real environment before it shipped. CI ran unit tests in a sandbox. The integration suite never touched the actual cluster, so mesh routing, RBAC, network policies, and sidecars were all unknowns at deploy time.
Canaries are valuable. They were never built to be a testing strategy.
What's the worst production failure your canary missed?
#Kubernetes #SRE #DevOps #PlatformEngineering #CloudNative