wlu @wlu314 - Twitter Profile

agentic UI QA system for mobile apps: computer-use loops that verify behavior on real Android and iOS flows, running against emulator Cross-platform mobile UI testing is still hard for agents. Open to collaborating on stronger testing layers for tools like Devin.

0

51

wlu

@wlu314

3 days ago

inspector won hacker's choice + top 3 in devtools at uc berkeley AI hackathon it's an agentic ui qa system: computer-use loops for verification, emulator harnesses for macos, android, and desktop apps cross-platform ui testing is still hard for agents. open to collab on stronger testing layers for tools like devin. stage demo below 👇 #AI #AgenticSystems #QA #ComputerUse

Cognition @cognition

5 days ago

Devin tests its work before you review the PR. You review and approve the test plan. Then get back a screen recording with a visual checklist of step-by-step QA.

17

233

30

72

62K

2

0

176

wlu

@wlu314

4 days ago

holy fyck that sounds fun

Hiroo Onoda @OnodaCapital

4 days ago

bro wut

29

964

20

275

265K

0

72

wlu

@wlu314

4 days ago

@Oriku175 holy ui

1

0

641

wlu

@wlu314

4 days ago

so many ambiguous post to loop engineering but no specificity. genuinely so annoying can we stop gate keeping

0

45

wlu

@wlu314

4 days ago

@iyanmoonyang any advice on 2), feel like I send out so many dm's but nothing comes out of it

1

0

40

wlu

@wlu314

7 days ago

heard david holz talk about building hard problems for the future to solve, but personally, I liked his idea of a frontier lab building a model on anime for fun and remake all Madhouse anime like opm s2, ngnl

0

74

wlu

@wlu314

18 days ago

lmfao i need this to cortisol maxx please send this over

Pankaj

@the2ndfloorguy

19 days ago

i hooked my whoop to my work calendar to find which coworker gives me the most stress 🚨 thanks to fable, I reverse engineered whoop to pull per minute heart rate. nd matched spikes with cal events and attendees I now have a leaderboard and I think about it daily. few info masked for obvious reasons ;)

the2ndfloorguy's tweet photo. i hooked my whoop to my work calendar to find which coworker gives me the most stress 🚨

thanks to fable, I reverse engineered whoop to pull per minute heart rate. nd matched spikes with cal events and attendees

I now have a leaderboard and I think about it daily.

few info masked for obvious reasons ;)

1K

45K

3K

15K

11M

0

233

wlu

@wlu314

18 days ago

@aleabitoreddit do you think the broader market is selling to have liquidity in the IPOs for anthropic, oAI and spacex

7

4

0

697

wlu

@wlu314

19 days ago

1/ The current wave of "Fable 5 one-shot a full Minecraft clone" videos are the latest AI demo slop YES, it is visually sick seeing the code synthesis of the world gen., meshing, lighting, physics, etc. But as a test of agentic capabilities? This is just noise. This framing is broken and what actual evals looks like, esp when there's a full repo of a minecraft clone 2/ Minecraft is prob. so heavily documented. Training data is saturated with this. Open source referenes like Minosoft have been public for years. https://t.co/KmjDug5Rwd when you tell a model "recreate minecraft" the model is just recombining pattern and public code, not demonstrating. robust agent behaviors. using an actual open client as a reference would be the more serious move if you want protocol-accurate agents or bots. 3/ This setup is one-shot code generation, not agentic behavior in any meaningful sense. Real agentic systems require: - Long-horizon planning and decomp. - Live perception-action loops with feedback - Error detection and recovery - Tool use or skill accumulation over time (continual learning) - Generalization under shift (new worlds, tasks, perturbations) A single forward pass that spits out a self-contained web app (even a solid modular one with ~15 source files handling noise, chunk meshing + AO/lighting, player physics/raycasting, biomes/caves/ores, procedural audio, etc.) doesn't test any of that. 4/ If the goal is to showcase real model progress on complex/ambitious coding or agentic tasks, the bar should be higher than just a video showing the demo: - Release the full output and show the traces - What was the model "thinking"? Any explicit planning, self-critique, decomposition steps, or reasoning before/during generation? (ReAct-style, chain-of-thought, or whatever scaffolding was used.) - Link runnable artifacts (not just a video). 5/ "Show your traces, bro" isn't gatekeeping. it's the minimum for this space. Demos without process are entertainment and engagement farming. Traces, ablations, reproducible harnesses, and comparisons are how we actually learn where models improved and where they still hallucinate protocols, physics, or recovery strategies. Let's point that capability at rigorous, open, long-horizon agent benchmarks instead of another round of saturated Minecraft re-creations.

Angel 🌼

@Angaisb_

19 days ago

Claude Fable 5 (max), Minecraft in HTML test Really, really good result, it even added background music I don’t know the exact cost, but it was around $30

180

5K

164

1K

1M

0

1

0

392

wlu

@wlu314

19 days ago

this benchmark dropped the same day as Claude's Fable 5. there was so much hype around it that this benchmark kind of faded as noise. Opus 4.7 needs up to 49 hours just to run the hardest tier of Agents’ Last Exam and still only passes 2.6% of the tasks. this new benchmark was built from 1,490 real professional workflows across 13 industry clusters and 55 subfields. over 300 domain experts contributed actual projects from their own work and tasks that normally take humans days or weeks to complete. Agents get full computer access (GUI + CLI) and are scored objectively on whether they deliver the correct final output.

Zengyi Qin

@qinzytech

19 days ago

Introducing Agents’ Last Exam (ALE) Built by 300+ domain experts from 100+ institutions Covering 55 industry domains Claude Opus 4.8 has 0.0% pass rate on the hardest subset Glad to have contributed to this benchmark

7

140

9

83

39K

0

1

0

149

wlu

@wlu314

19 days ago

there’s a lot of research on benchmarking/evals: cheatsheet icl, data contamination, and agentic tool use retrieving answers. this abstracts thinking and planning in models. it doesn’t remove it but guides serves as a point of reference for the model to rely on. pathfinding in multi-screen areas is also affected by this, since the model can reference guide maps instead of learning the layout itself. when a model has access to a guide it doesn’t need to explore or build an internal map of the environment, it just looks up where to go. unknown games provide a more useful way to understand how agents plan in game environment. i think that planning within these env are the biggest part. if the computer use agent was given ability to move toward all parts of the env., then it could plan and understand the limits of the env. without a guide the agent is forced to actually learn the space rather than retrieve a known path. I have a lot of pokemon guides and a lot of them have images. Even Easter eggs that are unknown to agents could change the way the agent plans since it has no reference point to fall back on. if the agent never saw that part of the game in training it has to genuinely reason about it which is exactly the kind of planning we should be testing for. Thoughts? https://t.co/8606pdt3U8 https://t.co/vosOUv54hv

0

1

0

118

wlu

@wlu314

19 days ago

@ChaseBrowe32432 @ItsBrain4Brain not yet**

0

29

wlu

@wlu314

19 days ago

@ChaseBrowe32432 @ItsBrain4Brain https://t.co/4na9E3QtnP even a. zhang doesn't think that we can one shot video games tho

3

0

208

wlu

@wlu314

19 days ago

@ChaseBrowe32432 @ItsBrain4Brain but i would like to see it too

0

24

wlu

@wlu314

19 days ago

@ChaseBrowe32432 @ItsBrain4Brain are you talking about the computer use agents completing pokemon or other 2D games

1

0

62

wlu

@wlu314

19 days ago

@TTrimoreau you will never be able to understand your full stack, or won't have the time commitment to understand anything, and will likely have a lot of hallucination issues depending on your "unicorn". there was a guy selling peptides who was doing 9 figures in arr.

0

38

wlu

@wlu314

19 days ago

@NanouuSymeon haskell > rust

0

43

wlu

@wlu314

Last Seen Users on Sotwe

Trends for you

Most Popular Users