@yunyu_l Neat benchmark. Two questions:
1. The accuracy is from the best of three runs. Do you define run as the whole twelve-month span, or is each month considered a run?
2. How different does this look if accuracy is reported as an average of three? Was there substantial variance?
@snowmaker The top-scoring approach using images only. If you consider other approaches using the accessibility tree, it's essentially the same as the previous SOTA.
@emollick And if you subscribe to the beliefs of some folks that work at the big labs, then literally all work _except_ for research towards AGI is a waste of time.
@emollick A truly general agent would reshape the entire economy, full stop. Most folks (not just in the AI space) that continue onwards with building are implicitly betting that they still have quite a long time before that happens.
@levelsio The "no-bs" part doesn't seem to take into account the fact that they do bullshit all the time. Saying this as a former Tesla owner that was promised the ability to summon my car from anywhere way back in 2019.
With incremental model updates, you appear to receive more intelligence for just changing a few characters around in a string. The reality is that you almost always need to tweak your prompts to accommodate for new idiosyncrasies.
With LLMs, there is truly no excuse for leaving out delightful little flourishes in your product. Fun animations, twee bits of whimsy, a splash of pizzazz — all achievable now with barely any effort. Have fun out there, folks.
@emollick Yeah, it makes sense, just kind of funny that he essentially invalidates the efforts of folks at OpenAI working on the GPT Store, SearchGPT, the macOS client, etc.