AI coding agents often lose context.
I built Debug-Ledger to fix this. Instead of letting AI guess, you set hardcore constraints it must respect.
Total control for developers.
https://t.co/4gI7d5M8UO
#AI#ParseOS#BuildInPublic
a python habit that quietly removed a class of bugs for me. the moment a dict picks up a third key that i keep typing slightly wrong, it becomes a dataclass. `user["emial"]` doesn't fail where you wrote it, it fails three functions later with a keyerror or a silent none, and you go hunting. a dataclass makes the typo fail at the attribute, with autocomplete that stops me making it at all. i'm not converting every dict, short-lived bags of data are fine as dicts. it's the ones that travel, get passed between functions, get keys read in five places. once it's load-bearing, the cost of a typo'd string key is a debugging session, and a dataclass turns that into an editor squiggle. cheap change, the bug just stops existing.
the change i made to claude code that cut my review time the most isn't a hook or a skill. before it edits anything, i ask it to list the files it's about to touch and one line on why each. then i read that list before a single line changes. half the time the list is longer than i expected, it wants to refactor a shared util "while it's there," or edit a config i didn't think was related. i cut those lines, tell it to stay in scope, and only then let it write. reviewing a list of five file names takes ten seconds. reviewing five files of already-written diffs takes twenty minutes, and by then i'm tempted to just accept it. moving the scope decision before the edit instead of after is the whole trick.
the worst agent runs i've had weren't the ones that failed fast. they were the ones that kept going. it would hit the same wall, retry a slightly different way, hit it again, and burn forty steps looking busy while making zero progress, and i'd only notice from the bill. so now i give every agent two things. a budget, in steps or tokens, and an instruction to stop and tell me when it's spent a chunk of it without the task getting closer to done. not "stop on error," errors are fine, it's the quiet churn with no error that costs the most. the report is the real fix. an agent saying "i'm a third through the budget and i'm stuck on the same import" is worth more than one that silently spends the whole thing and hands me a half-answer.
the orchestrator framing matches what i'm seeing too. the part that still needs me isn't permission, it's attribution. when the cheap sub-models do the actual edits, a wrong line is hard to trace back to which one wrote it. are you tagging output by sub-model, or reviewing the merged diff as one blob?
the swe-bench gap is real but it stopped predicting my day-to-day a while ago. what actually changed how much i trust a model wasn't the score, it was the failure mode, whether it fails loud or fails confident. a model two points higher that fails confident costs me more. does frontiercode capture failure style at all, or just pass rate?
the "looks good to me" isn't the bug, it's the symptom. the actual tell is it never re-ran the three tests it broke before saying that. i stopped asking it to review its own work and started making it paste the failing test output first. it can't write "looks good" over a red test it's staring at.
the roadmap tells me what we're building. it never tells me what we already decided not to. so the same "should we add x" question keeps coming back every few weeks, and someone re-argues it from scratch, and sometimes we flip purely because the last person to bring it up was loud. now i keep a second list next to the roadmap, the won't-build list, with one line of why for each. "no team accounts yet, support load before revenue." "no mobile app, the web flow isn't even good yet." when the idea comes back i point at the line instead of reopening the debate. it's not permanent, things move off the list when the why stops being true. but it stops me re-deciding the same thing on mood.
most features i've shipped died slowly instead of getting killed. nobody ever decided to stop, the thing just sat there half-used while we kept patching it. so now before i build anything that takes more than a few days, i write down the one number that would make me delete it. not the success metric, the kill number. "if fewer than x people use this in two weeks, it goes." i write it down before i start, because after i've built it i'm too attached to read the data honestly. the useful part isn't the threshold being right. it's that i've already agreed with myself what failure looks like, so when the data comes in i'm reading it instead of arguing with it. half my roadmap got shorter the week i started doing this.
the unit of work growing is real, but my review didn't scale with it. a 20 minute run i could replay in my head, a 2 hour run i can't. what helped was having it write down each decision it commits to as it goes, so what i review is the decision list, not 4M tokens of transcript. curious whether you review the long runs differently or mostly trust the end state plus checks.
the part that bit me after setup: the thinking level you set quietly sticks to the session and then you forget it's there. i judged fable as mediocre for half a day before noticing that window was on low. i keep the level pinned in a per-repo note now, dull but it works. do you run one default everywhere or change it per task?
useful, matches what i'm seeing. one thing i'd add from daily use: i stopped picking effort by task difficulty and started picking by how expensive a wrong answer is to catch. cheap-to-verify work gets low effort even when it's hard. also, "high goes wild" sounds like variance rather than capability. did spread across runs change in your tests, or just completion time?
a small trick that improved everything i write with ai help. before i act on a plan, send an email that matters, or hand instructions to anyone, i paste the text into a fresh chat and ask it to explain back what i'm asking for, in its own words. wherever it garbles, guesses, or fills in something i never said, that's exactly where my writing was vague. it's not testing the ai, it's testing me. the model can only work with what's actually on the page, same as the person who'll read it. costs thirty seconds. the plans that survive the explain-back unchanged are the ones that go smoothly, and i don't think that's a coincidence.
the first place agent-written frontends break for me is the back button. the flow works perfectly going forward, then i press back and land on a blank screen or a wiped form. the agent keeps everything in component state and nothing in the url, because it built the app in one continuous session and that's the only way it ever experienced it. refresh mid-flow, open the link in a new tab, share a step with someone, same failure. so my first manual test on any generated flow is deliberately boring. press back at every step, refresh in the middle, paste the deep link fresh. two minutes, catches more than another read of the code. the url is part of the interface, the agent treats it like decoration.
how i do coding challenges now without the agent eating the learning. i solve it myself first, ugly and slow, no help. then i hand the same problem to the agent and diff the two solutions line by line. the diff is the actual lesson. sometimes it knew a builtin i didn't. sometimes my version handles an input its clean-looking one drops. either way i learn something specific instead of just reading a polished answer i'd forget by friday. asking the agent first teaches me what it knows. solving first and diffing teaches me what i don't. same problem, same tool, completely different outcome depending on the order.
spent a week convinced my model had degraded in production. it hadn't. the serving path was computing one feature differently than the notebook did. training filled missing values with the median, the api filled them with zero. same model file, different inputs, quietly worse predictions for every row with a gap. nothing errored, nothing logged, the metric just sagged. now i log the exact raw feature vector at serving time, sampled, and diff its distribution against the training set weekly. it's the dullest job in the whole pipeline and it has caught more real problems than any architecture change i made this year. the model is one artifact. the inputs are rebuilt twice, and the second build is the one nobody reviews.
the quietest way i've overfit was never in the training loop. it was me, checking the validation score after every tweak. each peek is a tiny gradient step on the val set, taken by hand. tweak, check, keep what helped, repeat forty times, and the val score is now partly a measure of how hard i searched, not how good the model is. the test set told the real story and it was worse. what i changed: i decide the next three experiments before looking at anything, run all three, then look once. fewer peeks, bigger steps between them, and a held-out set i'm only allowed to touch when something is actually shipping. the model overfits data. i was overfitting the scoreboard.
@prz_chojecki the partial progress column is the one i'd weight hardest. in real work a model that gets 70% of the way and leaves a legible trail beats one that hands back something complete and confidently wrong. that gap never shows in the solved count.
remote work taught me to limit how much i take on in parallel. ai tools almost made me unlearn it. when delegating feels free, you queue five tasks at once, and for a week i did. generation was instant, but every finished task still needed me to check it properly, and i became the bottleneck i used to complain about in managers. five half-reviewed pieces of work, zero shipped. now i cap it at two agent tasks in flight, same as i'd cap my own. the limit isn't what the tools can produce, it's what i can verify with real attention. that's the part of the job that stayed mine, and the discipline looks exactly like the old remote rule: finish and hand off before you pick up the next thing.
the most useful question i ask an ai isn't before the task, it's after. when it says done, i ask what it skipped. what did you leave out, what did you assume, what would you flag if you were reviewing this. the answers are surprisingly honest. it will calmly list the edge case it didn't handle and the assumption it made about my data, things it never mentioned while delivering. i used to find those by hitting them a week later. the done message is written to close the conversation, the skipped list is where the actual state of the work lives. takes thirty seconds and it has changed what i trust on every tool i've tried it with.
changed how i review agent output and it caught more in a week than a month of normal reading. i read the deletions first. additions are where the agent looks impressive, deletions are where it breaks things. it removes a check it didn't understand, an edge case that looked redundant, a config line that mattered on one machine. the green lines get all the attention because they're the new feature. the red lines are where the old knowledge leaves the codebase. so now the first pass is only what got removed and whether i can say why it was safe. if i can't explain a deletion, that's the conversation with the agent, not the new code. second pass is the additions, and by then i actually know what changed.