this is a cool workflow, but I'm here thinking, how's it different from an agent orchestrating through subagents, or for example claude agent teams that has communication between the agents?
Main benefit I see is you can continue off any of these threads and it's easier to read through each thread. But likely missing something here.
if you want an easier step before looping everything, recommend similar to Thomas here. Right now I'm using one orchestrator thread per project, which has collected a lot of context on my product goals and thinking.
The orchestrator delegates to other threads, similar to subagents, the Codex app has thread creation, read thread, worktree options, etc. Then you can decide on setting a small loop from your kicked off task, or just manually calling orchestrator to check after.
A lot easier to start off without burning your tokens. Plus I find the orchestrator flow saves a lot of time repeating myself on previous context.
The tokenization of everything continues.
With 173 new tokenized stocks and ETFs live across Ethereum, Solana and BNB Chain, Ondo Global Markets now brings 430+ of the world's most in-demand assets to users everywhere.
Powered by LayerZero.
https://t.co/KTlorK3oy0
$260B+ across 830+ OFTs on 170+ chains.
From memecoins to tokenized treasuries to state-issued stable tokens, builders of all shapes use LayerZero, configured exactly to suit their needs. Read more about the Builder Spectrum at the link below.
https://t.co/77sMqimExT
The world's most anticipated IPO, live onchain from day 1.
Starting today, Ondo brings SpaceX exposure to Ondo Global Markets across Ethereum, Solana, and BNB Chain.
Powered by LayerZero.
https://t.co/LEQ09ZwfM4
building a computer use mcp that works with any agent. Give your claude, devin, grok and any other agent computer use.
computer use helps with a lot, some faves:
- work across your computer in the background
- agent self verification and feedback loop
- handle annoying tasks on computer
- help parents fix their computers
Link below. Initial version and will likely break
was wondering why FrontierCode's results were so different from DeepSWE from @datacurve, plus a lot of the vibes on X seem to agree with DeepSWE on gpt 5.5 performing better than opus 4.8.
Turns out the benchmarks are grading differently. Asked codex and claude, and both agree that these benchmarks aren't directly comparable. FrontierCode judges more production readiness, and whether this code would be mergeable. DeepSWE is more judging if the agent can solve long horizon repo tasks with behavioral correctness.
FrontierCode might be better at judging enterprise code, and DeepSWE seems to be more focused on "will this work". Hopefully someone can do a deep dive comparison, maybe @theo?
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
Grok Build is improving crazy fast.
Wanted to check out the latest updates since I see their team shipping updates multiple times per week, props to them, and it already has most of the features I expect out of cli agents. Plugins, marketplace, btw, etc.
Also, with @elonmusk tweeting out that the 1.5T model is going through RL, it'll be very exciting to see how Grok performs after. Plus they seem to be the least compute constrained.
I think with automations (/loop), /goal, and computer use that'll cover most of the features devs look for in cli agents, other than harness perf.
Demoing some of my fave devx improvements, wish all cli agents had: /theme and double click to navigate to your last message. I know these are small features, but I think it shows the quality and devx details.
auto review -> approve for me.
I've switched from full yolo mode to auto approval awhile ago, and haven't noticed an impact on devx. Plus this gives me extra peace of mind.
Would be cool if @OpenAIDevs gave a breakdown of how it works behind the scenes, what triggers a permission check, how is that evaluated?
if companies continue to have AI spend problems I think model routing will be an important layer AI labs work on.
So far I only know @cursor_ai and @FactoryAI doing model routing, maybe @cognition?
Yes claude and codex let's you spin up agents/subagents with different thinking levels and models. But it's not the same, I'm thinking an enterprise option for "auto" thinking level, and companies can get a default model routing, or optionally customize their own model layer.
Switch between gpt auto switches between low, med, and even mini. Claude switches between opus, sonnet, haiku, high, med, etc.