Your AI agent passed 41 unit tests. The eval suite was green. Your teammate merged and went home.
By Monday it was quietly denying refunds it should approve — in ~1 of 20 real conversations. No error. No alarm. Nothing "failed."
Why green CI lies about agents 🧵
@adxtyahq production is where all your untested assumptions get billed at once. vibe coding the first draft is fine - but a smoke test on the money flow before shipping isnt a guardrail, its just math
@trikcode "your users find the bugs you never could" belongs on a wall. the fix isnt more AI, its the unglamorous part: a test pinning the behaviour you think you built. if you cant describe it in a test, the agent cant either - youre both just vibing til a user complains
@AnatoliKopadze "my job is to create loops" is the realest framing ive seen. the catch: a loop is only as good as its exit condition. if the agent loops against a flaky or weak test signal, it'll happily "converge" on broken code. good loops need a signal that tells the truth
@burkov the 28-min build is impressive. the other 28 days happen after. CI/CD is easy; the hard part is CI/CD that actually catches a regression instead of just going green. scaffolding got fast, trust still takes time
@KevinNaughtonJr rollback is a perfectly valid fix, dont let anyone shame you. the real question is boring: why did it reach prod green? rollback buys you time, a test that reproduces the bug buys you sleep. ship the rollback, write the test that turns "again" into "never"
@brankopetric00 docker only gives reproducible builds if you pin everything - base image by digest not :latest, lockfiles, fixed versions. half the "works in docker here but not in CI" cases are someone pulling a silently-updated :latest. equal misery everywhere lol
@0xlelouch_ the flaky test one hits home lol. but the 3hrs of avoidance is kinda rational - fixing flakiness is miserable because you cant reproduce it. dont fix it live, quarantine it and collect failures til the pattern shows. way less soul-crushing
@GergelyOrosz $50 a run sounds steep til you price the alternative: one regression reaching users. dont run full evals every time - gate on a small golden set per-PR (cheap, fast), full suite nightly. same safety, fraction of the bill
@dhh the local-vs-CI time gap is brutal and underrated. usually its not the tests, its everything around them - cold caches, reinstalling deps, zero parallelism. 15min->30s jump is almost always deps cached + parallel, not "we deleted tests." nice result
@catalinmpit right question. almost nobody asks it. the postmortem blames the junior, never the gap in coverage. add the test that wouldve caught it before you close the ticket, or youre just waiting for round two
@trashh_dev your local env is lying to you. CI is the honest one. 9/10 times it's hidden state - a file only on your machine, a service still running, or test order that happens to work locally. run in a fresh container once and watch half of them fall over
#6 is the silent career-killer because it compounds. ignore flaky tests long enough and the whole team stops trusting CI - then a real regression ships and suddenly it's "how did nobody catch this." the bug isnt what gets you fired, the eroded trust is. great list, painfully accurate lol.
@Prathkum exactly. "works on my machine" is a determinism problem, and AI is worse at it than we are - it'll confidently ship code that passes a green suite and dies on the one input nobody tested. closing that gap is judgment, and judgment doesnt autocomplete.
copy package.json + lock FIRST, npm install, THEN copy the rest. that way the install layer only busts when deps change, not on every source edit - docker caches it. shaves the 8m install off every build where you didnt touch dependencies. classic layer-order gotcha, bites everyone once lol.
solid breakdown. the thing the diagrams never show: CD only works if devs actually trust the CI. the moment flaky tests creep in, people start rubber-stamping red and "just re-running" - and that habit ships real bugs. continuous delivery is 20% pipeline, 80% trusting the green checkmark.
@alexxubyte great list. the one component everyone draws but nobody budgets for: the CI/CD pipeline staying trustworthy. a pipeline with flaky tests quietly becomes decoration
"ship garbage faster" - stealing that. ive seen the exact same thing: agent drops 600 lines across 30 files, every diff looks plausible, and theres no test to tell you which 3 of them quietly broke checkout. on a clean codebase the agent has guardrails, on a no-test legacy monolith it has vibes.
the cheapest fix before letting an agent loose: add a smoke test on the 2-3 flows that actually make money (login, checkout, whatever pays the bills). doesnt need to be pretty. it just needs to scream when the agent "refactors" something that worked. a foundation that says no is worth more than a smarter model.
Smart-quarantine = the test still runs and reports, it just doesn't gate the merge — so coverage stays, red stays trustworthy. And we watch the model's own drift (PSI) so it doesn't rot.
Not blind rerun. Not raw history. Prediction, in-pipeline, self-hosted.
→ https://t.co/NCx66ayA4v
Tests fail. AI explains why.
Google ran the numbers: 84% of their test failures that flip from pass→fail are flaky — not real bugs.
Flaky tests are the tax almost every CI pays and almost no CI fixes.
What flaky tests actually cost, and how to kill them — a thread 🧵
The better question isn't "re-run or mute?" — it's "is this red signal or noise?"
Testhide answers it with a Flakiness Predictor: gradient-boosted trees over 41 features, scoring each failure — its history (top signal: recent fail-rate) + build context.
Real regression → block the PR. Noise → smart-quarantine. 👇