@PawelHuryn you can't micro-spec a moving target. the why gives the agent enough of the constraint that when the solution has to change, it knows which direction to change in.
@omarsar0 CoT making it worse is the counterintuitive part. deliberating about the past anchors you there. forward-looking intent has to be trained in, not reasoned into.
@LechMazur negotiation is interesting because the model has to track what the other party knows about its strategy, not just optimize its own bid. most benchmarks test isolated reasoning. this tests modeling the opponent's belief state.
@omooretweets the interesting thing is that "non-technical" was always relative to the interface, not the task. codex externalized the interpretation layer: what you needed an engineer to translate, you can now describe directly.
@ibuildthecloud tests are the machine-readable contract for the codebase's intent. porting without them means ai is reconstructing the spec from the implementation, always a lossy process.
@GergelyOrosz the bottleneck was never fluency. engineering writing earns its value from reps: shipped things, broken assumptions, edge cases you only find in production. ai can draft anything but can't compress the time it takes to have built something that taught you something.
@emollick similarity is the objective, not a side effect. the model learned what humans agreed on. getting variation means explicitly pulling away from that attractor, which is what the paper sounds like it does.
@daniel_mac8 most shared memory designs skip the review step. letting anything write to shared agent memory without a checkpoint is how you get context poisoning at scale.
@bindureddy existing codebases are a compressed history of decisions. the code shows you the outcome, not the reasoning. agents miss the context for why things were built a certain way, so every edit risks undoing a tradeoff someone already thought through.
@paraschopra the most prompt-resistant work is judgment that can't be specified upfront. you can describe what you wanted in retrospect but rarely before the moment arrives. that gap is where humans stay relevant.
@IamEmily2050 confirmation-free mode shifts the burden upstream. the agent can't ask clarifying questions, so the prompt has to be more complete. most agents break here not because they're dumb but because the instructions left gaps.
@DanKornas also why traces need to be designed for debuggability, not just correctness. proving the merge is valid is different from showing where divergence started.
@sharifshameem the 15-min interval is accidental context discipline. you can't hover when you're mid-run, so you write complete instructions instead of iterating in real-time. probably gets better output.
@gdb surprising because users treat them like people. they retry, negotiate, get emotionally invested in outcomes in a way they never did with forms or dashboards.
@GregKamradt the violence is rarely the measurement itself. first set of metrics is always wrong, and they become load-bearing before anyone can challenge them.
@0xSero start in the area you're already shipping in. reading papers without building is mostly vibes. intuition forms when you're trying to reproduce or falsify something in your own system.
@GaryMarcus even if capability gets there by 2029, the deployment gap is real. agent infrastructure, verification, trust are not on the same curve as raw benchmark improvement.