Introducing WhateverAI β the Execution Layer for AI Agents π
AI can think, plan, and reason.
But it still canβt act across real apps, real devices, or real workflows.
We fix this.
We give agents a cloud environment where they can see, click, type, navigateβand execute.
5 things cloud phones can do that emulators can't
1/ run the real play store, the real app store, with the real drm. emulators get banned in 10 minutes.
2/ pass device attestation. safetynet, play integrity, device check β all green. emulators fail every time.
3/ receive real sms and otp. needed for any login flow that matters.
4/ get real gps, real imei, real carrier. needed for ride-share, delivery, banking.
5/ scale to hundreds of devices without melting your laptop. try that on an emulator.
every investor asks "why games?" as if games are a detour.
games are not a detour. games are the hardest real-time mobile automation problem you can pick. if your agent plays whiteout survival, your agent can also:
- run a vendor portal
- triage mobile tickets
- operate apps that don't have apis
we're not building a game bot. we're building an execution layer that happens to pass the hardest test.
you don't have to wait for the agent to finish.
we stream thinking deltas over sse, 10-second windows. you watch it reason in real time.
debugging an agent used to mean reading the logs after it crashed. now you can catch it mid-thought.
mobile games are the hardest environment for an agent. timed events, random popups, gacha dialogs, daily quests, alliance pings. the ui changes hourly.
if your agent can run a f2p gacha grinder for 168 hours without a human touching it, it can run a sales pipeline.
why? because both are long-horizon tasks with fuzzy goals and adversarial uis. games just run the clock faster.
this is phase iii of our roadmap. foundation (ship the exec layer) β vision (make it see) β skills (make it outlast you).
in progress: 7x24h whiteout survival run. fully unmanned. we'll post the logs.
if this works, every "mobile agent" demo you've seen is a toy in comparison. stay tuned.
you don't have to wait for the agent to finish.
we stream thinking deltas over sse, 10-second windows. you watch it reason in real time.
debugging an agent used to mean reading the logs after it crashed. now you can catch it mid-thought.
our agent: "tapped the login button"
also our agent: "there was no login button, i tapped the logo which also worked"
vision models have main character energy
regression testing mobile apps is soul-crushing. manual sweeps. ancient emulators. flaky appium scripts.
here's what it looks like with whateverai: agent receives the test plan in plain english. picks a device. runs the flow. screenshots the diffs.
if the flow breaks, the agent doesn't just fail β it tells you WHY in natural language. "tapped 'checkout', expected payment screen, got a 500 toast."
the qa team doesn't write selectors. they write intent. the agent handles the rest.
support ticket π²ποΈ
"the app crashes when i try to upload a photo on android 13, samsung s22"
old world: support engineer spends 2 hours reproducing
new world: agent spawns a cloud phone, matches the device, reproduces the bug, captures the stack, files the jira
hours β minutes β²οΈ
language support isn't a checkbox
whateverai ships with chinese + english out of the box. not because we translated a string file. because the vision model reads both natively.
means your agent works on wechat, xiaohongshu, taobao, meituan β the same way it works on instagram, amazon, doordash.
most "mobile agents" die at the gfw. ours ships past it.
5 reasons dom-based automation is a dead end
1/ the web was the last environment where dom-scraping worked. mobile is not the web.
2/ reason one: no dom. native apps don't expose a tree. you're staring at pixels.
3/ reason two: dynamic layouts. one carrier update and your xpath is garbage.
4/ reason three: captchas, webviews, and hybrid UIs break parsers in new ways every week.
5/ reason four: accessibility trees lie. half the labels are wrong or missing.
6/ reason five: humans don't use the dom. if your agent is smarter than a human, it shouldn't need training wheels the human doesn't have.
dom-scraping agents: break when the button moves 4px
vision-native agents: "there's a blue button that says continue. i'll tap it."
the future looks at screens the way humans do.
one agent key.
hundreds of devices.
serial execution per key (AGENT_BUSY protects you from race conditions).
parallel across keys.
you can run an entire fleet from one codebase. each device gets its own vision context, its own plan, its own outcome.
this is how you scale agents past "cool demo" into actual work.
why we collapsed 12 tools into 3
1/ every mcp server out there: 40 tools, 200 parameters, 10k token descriptions. agents drown in their own toolbelt.
2/ we shipped whateverai with 12 tools. tap. swipe. scroll. type. read_screen. blah blah. agents spent more tokens picking tools than solving problems.
3/ so we cut. 12 β 3. execute. execute_and_wait. task_result. that's it. the agent plans, the phone acts.
4/ the unlock: the tool isn't the action. the goal is the action. let the agent reason at the goal level, let the execution layer figure out the clicks.
5/ fewer tools = better planning = faster agents. ship less, do more.
last month: 12 fine-grained ui tools
this month: 3 tools
cloudphone_execute
cloudphone_execute_and_wait
cloudphone_task_result
your agent doesn't need a swiss army knife.
it needs a hammer that works.
. @AnthropicAI accidentally shipped 500K lines of Claude Code source to npm.
no credentials leaked. no customer data exposed.
but it's a reminder: when agents operate autonomously, governance tooling isn't a nice-to-have.
isolated sandboxes. encrypted execution. hardware-bound credentials.
this is why we built WhateverAI the way we did.