@AndrewYNg The role you mention almost in passing, Harness Engineer, is the one I'd bet on. Models and vendors swap out. The harness persists: prompts, tools, skills, evals, subagent orchestration. The harness holds the workflows and it outlives whatever model sits underneath.
Spent months maintaining a model router in OpenClaw. One for ops, one for code, one for writing. This week I deleted most of it. Opus 4.8 clears the bar on all three. The routing was overhead pretending to be sophistication. Sometimes the upgrade is deleting the abstraction.
@KingBootoshi Same. I stopped working with Opus like an interactive UX. My Claw does the planning pass, then I run single-shot console work with cli: Opus drives the plan, Codex does the build. Slow model stays off the critical path; fast builder executes against the plan.
We built a version of this internally last week. AI writing detector that mines its own shipped output weekly, finds new tells the current rules miss, promotes patterns after 2 weeks of recurring detection. Evals that update themselves are how you sustain quality as your own voice drifts. Applied at org level skill scale.
@billxbf Super cool release. I have been using Openclaw RL which does something similar but great to have a multi harness approach and will give it a shot to compare: https://t.co/16XCHvPDdu
We’re moving from “agentic workflows with evals” to “environments where agents can practice.”
Tools, fake databases, real constraints, verifiable rewards.
Basically video game-like simulations for the messy systems and workflows that plague real enterprises.
This is how agents start learning actual work instead of just becoming workflow automation++.
https://t.co/CGwrBppumI
Regardless of whether this matches the hype or not this shines a spotlight on Subquadratic Sparse Attention and more startups or indie hackers (like me) will experiment with it. Interesting paper as well from the team: https://t.co/rQDsd2Ygtf
my first take, and a good lesson on good research epistemics here: what can we infer from ~82% SWE-Bench?
it’s possible they (1) they trained a new model, from scratch, that is unlike a regular transformer
but i’ve never heard of this company before, and checking their funding round they’ve only raised ~30M, so it’s unlikely they could/afford to train a Opus/GPT-5/Kimi 2.6 level coding model right now from scratch
so this tells us that (2) they need to bootstrap off of an existing pretrained model, likely RL too, to get that performance!
this tells us they’ve taken a vanilla Transformer and modified the attention mechanism, likely finetuning/midtraining in a subquadratic attention method
its quite possible it doesn’t really work and that there’s some degeneracy to the method, or it’s just plain fake
but if it’s not, you could expect that given how long it takes to do weight surgery on big models (bigger changes to a pretrained model == longer mid training to recover performance), it’s a lightweight change
id lean towards something mostly leveraging existing attention key value protections like a fancy version of deepseeks sparse attention paper, but it could also be some unique test-time KV compression, which would come with its own downsides
@steipete Wow! Half of those could be its own vibrant github project. I built a local version of askoracle myself but going to swap to yours now. How do you handle task management for all this? Do you use Linear + Symphony? Multiple projects?
@thsottiaux /wiki shortcut: turn any repo into a living, source-linked project wiki. Architecture, setup, data flows, APIs, diagrams, and “how to change X” guides generated from the codebase. Similar vibe to Devin’s DeepWiki, but native in Codex.
This is the way. PM agent with Linear that pulls context. Dev agent, QA agent, provide way better results than single agent having to load all context and run all tasks.
@cgeorgiaw Congrats on the launch. Can I pitch a challenge and what does standing one up look like? The one I'd love to see: pre-symptomatic Parkinson's detection from voice. Biomarkers predict PD 3-5 years early. Consented speech corpus + biomarker model + years-to-symptom leaderboard.
@steipete and team continue to amaze. Reported a Bedrock streaming bug, fix shipped the next day. Just tested: Opus 4.7 streaming + extended thinking on AWS Bedrock, running clean as my main agent. Huge unlock for enterprise AWS users. This is shipping at AI speed!
interesting thing about minimax 2.5 is it's a smaller model
considering it's very usable it's a great candidate for home labs
also would love to see inference providers try and max out its tokens/s can probably do something crazy
@openclaw feature request: cron sessions persist indefinitely after completion — sessions from 20hrs ago still sitting at 200K tokens. would love a cron.sessionArchiveAfterMinutes config (like subagent archive). also model registry has opus 4.6 at 195K ctx but it's 1M now. love the project otherwise 🦞