Side-by-side benchmarks beating @OpenAI Codex computer-use using their own models! 👀
Round 1: Clip a youtube video from our channel and upload it to Tiktok
✅ https://t.co/AAsaZBioXq + GPT 5.4 + our computer-use-kit: Successfully uploads a clip with subtitles and hook after 16 minutes (and works in iPhone/Android)
❌ Codex + GPT 5.4: Gets the clip format wrong 3 times, asks for human intervention, and finally fails after 21 minutes.
Codex actually does try iPhone mirroring and Capcut, which is very cool and kudos to the team, but it ultimately fails after burning credits.
@sama this is not easy to do, but happy to help you guys integrate our computer-use-kit. 😀
I co-founded the first startup approved by openAI to sell GPT3 for automation in August 2021(Cheatlayer) months before https://t.co/aW1Cd9vOWN so I've been working on this for a long time.
We automate Mac/Windows/Linux/Chrome/Android/Iphone + @daytonaio sandboxes and @browserbase cloud browsers out of the box.
We also just shipped automated benchmarks, so we're building the most comprehensive computer-use benchmark for long-running tasks on the planet.
@bldinsilence_@paulg Yes basically, but that alone already exists and current solutions don't employ a cache in the way you want it to. You need a few more tricks like a model based caching policy, and to define the harness such that it executes the code each step. We'll be publishing more Monday.
@Something_xev Hi no I believe this to be an abusive way to get someone's attention so I use an exponential backoff rate limiting policy for anyone who tags me and does stupid shit like this. Basically, every time you ask me I will take longer to respond, but I felt bad so I will respond here
india's first female founder residency starts in less than 2 days.
cohort 0 · june 7-13 · bangalore
8 of the country's most exceptional builders.
get ready for @nova_residency :)
@QuantYang@paulg Basically people have solved booking flights and solving rubiks cubes thousands-millions of times over now, and we don't need to waste tokens to regenerate those solutions. Most use repeated cases in the world follows this principle.
@dunkhippo33 Yes!
Monday we're launching the first and only agent that can automate her own apps without integrations.
Meaning she can automate Uber, Doordash, Starbucks, and tasks that are impossible with Poke, Siri, etc.
So it solves the mirrored phone.
https://t.co/pOmaQ0xLgA
@yuris Monday we're launching the first and only agent that can automate her own apps without integrations.
Meaning she can automate Uber, Doordash, Starbucks, and tasks that are impossible with Poke, Siri, etc.
@mil000 Yeah but some people here make like $1000+/hour so opportunity cost wise it's actually worth it to shave minutes thinking. ROI for saving 1 minute at that level is $16+
We're coordinating with multiple big KOLs for an epic launch next week, so stay tuned!
If we don't get at least a million views and go viral I'll shave my head for our final demo day at Launch 😂