agentic UI QA system for mobile apps: computer-use loops that verify behavior on real Android and iOS flows, running against emulator
Cross-platform mobile UI testing is still hard for agents. Open to collaborating on stronger testing layers for tools like Devin.
inspector won hacker's choice + top 3 in devtools at uc berkeley AI hackathon
it's an agentic ui qa system: computer-use loops for verification, emulator harnesses for macos, android, and desktop apps
cross-platform ui testing is still hard for agents. open to collab on stronger testing layers for tools like devin.
stage demo below ๐
#AI #AgenticSystems #QA #ComputerUse
Devin tests its work before you review the PR.
You review and approve the test plan.
Then get back a screen recording with a visual checklist of step-by-step QA.
heard david holz talk about building hard problems for the future to solve, but personally, I liked his idea of a frontier lab building a model on anime for fun and remake all Madhouse anime like opm s2, ngnl
i hooked my whoop to my work calendar to find which coworker gives me the most stress ๐จ
thanks to fable, I reverse engineered whoop to pull per minute heart rate. nd matched spikes with cal events and attendees
I now have a leaderboard and I think about it daily.
few info masked for obvious reasons ;)
1/ The current wave of "Fable 5 one-shot a full Minecraft clone" videos are the latest AI demo slop
YES, it is visually sick seeing the code synthesis of the world gen., meshing, lighting, physics, etc. But as a test of agentic capabilities? This is just noise. This framing is broken and what actual evals looks like, esp when there's a full repo of a minecraft clone
2/ Minecraft is prob. so heavily documented. Training data is saturated with this. Open source referenes like Minosoft have been public for years. https://t.co/KmjDug5Rwd
when you tell a model "recreate minecraft" the model is just recombining pattern and public code, not demonstrating. robust agent behaviors. using an actual open client as a reference would be the more serious move if you want protocol-accurate agents or bots.
3/ This setup is one-shot code generation, not agentic behavior in any meaningful sense. Real agentic systems require:
- Long-horizon planning and decomp.
- Live perception-action loops with feedback
- Error detection and recovery
- Tool use or skill accumulation over time (continual learning)
- Generalization under shift (new worlds, tasks, perturbations)
A single forward pass that spits out a self-contained web app (even a solid modular one with ~15 source files handling noise, chunk meshing + AO/lighting, player physics/raycasting, biomes/caves/ores, procedural audio, etc.) doesn't test any of that.
4/ If the goal is to showcase real model progress on complex/ambitious coding or agentic tasks, the bar should be higher than just a video showing the demo:
- Release the full output and show the traces
- What was the model "thinking"? Any explicit planning, self-critique, decomposition steps, or reasoning before/during generation? (ReAct-style, chain-of-thought, or whatever scaffolding was used.)
- Link runnable artifacts (not just a video).
5/ "Show your traces, bro" isn't gatekeeping. it's the minimum for this space. Demos without process are entertainment and engagement farming. Traces, ablations, reproducible harnesses, and comparisons are how we actually learn where models improved and where they still hallucinate protocols, physics, or recovery strategies.
Let's point that capability at rigorous, open, long-horizon agent benchmarks instead of another round of saturated Minecraft re-creations.
Claude Fable 5 (max), Minecraft in HTML test
Really, really good result, it even added background music
I donโt know the exact cost, but it was around $30
this benchmark dropped the same day as Claude's Fable 5. there was so much hype around it that this benchmark kind of faded as noise. Opus 4.7 needs up to 49 hours just to run the hardest tier of Agentsโ Last Exam and still only passes 2.6% of the tasks.
this new benchmark was built from 1,490 real professional workflows across 13 industry clusters and 55 subfields. over 300 domain experts contributed actual projects from their own work and tasks that normally take humans days or weeks to complete. Agents get full computer access (GUI + CLI) and are scored objectively on whether they deliver the correct final output.
Introducing Agentsโ Last Exam (ALE)
Built by 300+ domain experts from 100+ institutions
Covering 55 industry domains
Claude Opus 4.8 has 0.0% pass rate on the hardest subset
Glad to have contributed to this benchmark
thereโs a lot of research on benchmarking/evals: cheatsheet icl, data contamination, and agentic tool use retrieving answers. this abstracts thinking and planning in models. it doesnโt remove it but guides serves as a point of reference for the model to rely on. pathfinding in multi-screen areas is also affected by this, since the model can reference guide maps instead of learning the layout itself. when a model has access to a guide it doesnโt need to explore or build an internal map of the environment, it just looks up where to go. unknown games provide a more useful way to understand how agents plan in game environment. i think that planning within these env are the biggest part. if the computer use agent was given ability to move toward all parts of the env., then it could plan and understand the limits of the env. without a guide the agent is forced to actually learn the space rather than retrieve a known path. I have a lot of pokemon guides and a lot of them have images. Even Easter eggs that are unknown to agents could change the way the agent plans since it has no reference point to fall back on. if the agent never saw that part of the game in training it has to genuinely reason about it which is exactly the kind of planning we should be testing for.
Thoughts?
https://t.co/8606pdt3U8
https://t.co/vosOUv54hv
@TTrimoreau you will never be able to understand your full stack, or won't have the time commitment to understand anything, and will likely have a lot of hallucination issues
depending on your "unicorn". there was a guy selling peptides who was doing 9 figures in arr.